9 datasets found
  1. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  2. Dolly-V2-7B

    • kaggle.com
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hyc (2023). Dolly-V2-7B [Dataset]. https://www.kaggle.com/datasets/hycloud/dolly-v2-7b/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    hyc
    Description

    Dataset

    This dataset was created by hyc

    Contents

  3. h

    dolly-15k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritesh Khanna, dolly-15k [Dataset]. https://huggingface.co/datasets/treadon/dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ritesh Khanna
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset Card for "dolly-15k"

      Summary
    

    This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.

      Special thanks to ❤️ Databricks for creating and making this set available.
    

    More Information needed

  4. databricks-dolly-15k.jsonl

    • kaggle.com
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt M. (2024). databricks-dolly-15k.jsonl [Dataset]. https://www.kaggle.com/mattm2/databricks-dolly-15k-jsonl/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Matt M.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Matt M.

    Released under Apache 2.0

    Contents

  5. bsc-dolly-15k-en

    • huggingface.co
    Updated Aug 31, 2000
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit @ Barcelona Supercomputing Center (2000). bsc-dolly-15k-en [Dataset]. https://huggingface.co/datasets/BSC-LT/bsc-dolly-15k-en
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2000
    Dataset provided by
    Barcelona Supercomputing Centerhttps://www.bsc.es/
    Authors
    Language Technologies Unit @ Barcelona Supercomputing Center
    Description

    BSC Dolly 15k EN

    Reviewed version from the Argilla Dolly v2 English version, originally created by Databricks. We provide two subsets: "annotated", where some instances were labelled with potential problems; and "filtered", which only contains the instances without the issues that we observed.

      Annotation process
    

    While analysing the Argilla Dolly v2 English version, we observed the following:

      Task classification:
    
  6. h

    mix-instruct

    • huggingface.co
    • opendatalab.com
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLM Blender (2024). mix-instruct [Dataset]. https://huggingface.co/datasets/llm-blender/mix-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    LLM Blender
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MixInstruct

      Introduction
    

    This is the official realease of dataset MixInstruct for project LLM-Blender. This dataset contains 11 responses from the current popular instruction following-LLMs that includes:

    Stanford Alpaca FastChat Vicuna Dolly V2 StableLM Open Assistant Koala Baize Flan-T5 ChatGLM MOSS Moasic MPT

    We evaluate each response with auto metrics including BLEU, ROUGE, BERTScore, BARTScore. And provide pairwise comparison results by prompting ChatGPT for the… See the full description on the dataset page: https://huggingface.co/datasets/llm-blender/mix-instruct.

  7. h

    AutoMultiTurnByCalm3-22B

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kan hatakeyama, AutoMultiTurnByCalm3-22B [Dataset]. https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByCalm3-22B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    kan hatakeyama
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    自動生成のマルチターンデータセット

      オープンなデータソースから、Calm3-22bを使ってQ&Aを自動生成したものです。
    

    一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました。

      データソース
    
    
    
    
    
      はじめの質問(q1)を、種々のデータソースから収集しました。その後のやりとりはすべて、Calmが生成しました。質問文については、元データのライセンスに準拠します。
    

    oasst2-33k-ja apache 2.0

    databricks-dolly-15k-ja cc-by-sa-3.0

    minnade CC0

    cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0

  8. P

    SurgeGlobal/LaMini Dataset

    • paperswithcode.com
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake (2024). SurgeGlobal/LaMini Dataset [Dataset]. https://paperswithcode.com/dataset/surgeglobal-lamini
    Explore at:
    Dataset updated
    Apr 17, 2024
    Authors
    Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake
    Description

    Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

    Dataset Generation

    Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.

    Dataset Sources

    Repository: Bitbucket Project Paper : Pre-Print

    Structure Each entry in the dataset contains: - Instruction - Response

    Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.

    Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini

    Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }

    Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake

  9. tulu-v1-sft-mixture

    • huggingface.co
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). tulu-v1-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Tulu Instruction Mix

    For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:

    FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)

    These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Organization logo

databricks-dolly-15k

databricks/databricks-dolly-15k

Explore at:
160 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

Search
Clear search
Close search
Google apps
Main menu