21 datasets found
  1. h

    coqa-sharegpt-format

    • huggingface.co
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BookingCare Technology .,JSC (2025). coqa-sharegpt-format [Dataset]. https://huggingface.co/datasets/BookingCare/coqa-sharegpt-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    https://bookingcare.vn/
    Authors
    BookingCare Technology .,JSC
    Description

    BookingCare/coqa-sharegpt-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    Code-74k-ShareGPT-Vicuna

    • huggingface.co
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Computations (2024). Code-74k-ShareGPT-Vicuna [Dataset]. https://huggingface.co/datasets/cognitivecomputations/Code-74k-ShareGPT-Vicuna
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2024
    Dataset authored and provided by
    Cognitive Computations
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Code-74k-ShareGPT-Vicuna This dataset is in Vicuna/ShareGPT format. There are around 74000 set of conversations. Each set having 2 conversations. Python, Java, JavaScript, GO, C++, Rust etc. code with detailed explanation are provided. This dataset has around 60~65% of Python code.

  3. h

    Atma3.2-ShareGPT

    • huggingface.co
    Updated Nov 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RS (2024). Atma3.2-ShareGPT [Dataset]. https://huggingface.co/datasets/HappyAIUser/Atma3.2-ShareGPT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 16, 2024
    Authors
    RS
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Atma3.2-ShareGPT

    This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.

      Dataset Description
    

    The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:

    An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction

      Usage… See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/Atma3.2-ShareGPT.
    
  4. h

    shisa-v2-sharegpt

    • huggingface.co
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shisa.AI (2025). shisa-v2-sharegpt [Dataset]. https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt
    Explore at:
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Shisa.AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    shisa-v2-sharegpt

    This is an updated version of the original shisa-v1 dataset augmxnt/ultra-orca-boros-en-ja-v1 and retains the same conversations field and sharegpt formatting to facilitate its use as drop-in replacement for the original dataset. The shisa-v2 revision filters a few entries, but largely retains the exact composition and prompts of the original.

    All responses have been entirely regenerated from open weight models (Athene V2, Llama 3.3 70B, and Tulu 3 405B) Outputs… See the full description on the dataset page: https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt.

  5. h

    Nectar-ShareGPT-clean

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip May, Nectar-ShareGPT-clean [Dataset]. https://huggingface.co/datasets/PhilipMay/Nectar-ShareGPT-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Philip May
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Nectar ShareGPT Clean

    This dataset is cleaned and created with 04_convert_nectar.ipynb based on berkeley-nest/Nectar. Main changes:

    convert to conversations format which is supported by Axolotl - see ShareGPT only use best rank answers clean invisible characters and strip - see mltb2.text.clean_all_invisible_chars_and_strip() remove rows with empty text remove rows from multiple sources (see source column)

      Licensing
    

    Copyright (c) 2024 Philip MayCopyright (c) Banghua Zhu… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/Nectar-ShareGPT-clean.

  6. h

    ultrachat_200k_sharegpt

    • huggingface.co
    Updated Feb 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinand Balachandran (2024). ultrachat_200k_sharegpt [Dataset]. https://huggingface.co/datasets/abhinand/ultrachat_200k_sharegpt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2024
    Authors
    Abhinand Balachandran
    Description

    Dataset Card for UltraChat 200k

    This is just the original ultrachat 200k dataset converted to sharegpt format.

      Dataset Description
    

    This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

    Selection of a subset of data for faster… See the full description on the dataset page: https://huggingface.co/datasets/abhinand/ultrachat_200k_sharegpt.

  7. h

    cosmopedia-japanese-subset_from_aixsatoshi_filtered-sharegpt-format-no-system-prompt_split_5...

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shido_wake (2024). cosmopedia-japanese-subset_from_aixsatoshi_filtered-sharegpt-format-no-system-prompt_split_5 [Dataset]. https://huggingface.co/datasets/shidowake/cosmopedia-japanese-subset_from_aixsatoshi_filtered-sharegpt-format-no-system-prompt_split_5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    shido_wake
    Description

    shidowake/cosmopedia-japanese-subset_from_aixsatoshi_filtered-sharegpt-format-no-system-prompt_split_5 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    SCP_40k-claude-3-7-sonnet-16k-sharegpt

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Cedric R. Idia (2025). SCP_40k-claude-3-7-sonnet-16k-sharegpt [Dataset]. https://huggingface.co/datasets/marcuscedricridia/SCP_40k-claude-3-7-sonnet-16k-sharegpt
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Marcus Cedric R. Idia
    Description

    Merged UI Dataset: SCP_40k-claude-3-7-sonnet-16k-sharegpt

    This dataset was automatically generated by merging and processing the following sources: mlfoundations-dev/SCP_40k-claude-3-7-sonnet-16k Generation Timestamp: 2025-04-03 17:50:36 Processing Time: 14.17 seconds Output Format: sharegpt

      Processing Summary
    

    Total Datasets Attempted: 1 Datasets Successfully Processed: 1 Datasets Failed/Skipped: 0 Total Input Rows Scanned: 49,603 Total Formatted Entries Generated: 49… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/SCP_40k-claude-3-7-sonnet-16k-sharegpt.

  9. h

    ATCgpt-Fixed

    • huggingface.co
    Updated Dec 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RS (2024). ATCgpt-Fixed [Dataset]. https://huggingface.co/datasets/HappyAIUser/ATCgpt-Fixed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Authors
    RS
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for ATCgpt-Fixed

    This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.

      Dataset Description
    

    The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:

    An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction

      Usage
    

    This… See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/ATCgpt-Fixed.

  10. h

    MMLU-Alpaca

    • huggingface.co
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RS (2024). MMLU-Alpaca [Dataset]. https://huggingface.co/datasets/HappyAIUser/MMLU-Alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2024
    Authors
    RS
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for MMLU-Alpaca

    This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.

      Dataset Description
    

    The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:

    An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction

      Usage
    

    This… See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/MMLU-Alpaca.

  11. h

    AEZAKMI_v2_sharegpt

    • huggingface.co
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam (2024). AEZAKMI_v2_sharegpt [Dataset]. https://huggingface.co/datasets/adamo1139/AEZAKMI_v2_sharegpt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2024
    Authors
    Adam
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    I moved AEZAKMI V2 in sharegpt format to a different repo so that it's easier to use with HF datasets library.

  12. h

    newnewdataset-sophie

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moritz Nickel (2024). newnewdataset-sophie [Dataset]. https://huggingface.co/datasets/Fischerboot/newnewdataset-sophie
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2024
    Authors
    Moritz Nickel
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    new version with more output examples, in sharegpt format

  13. h

    Viet-ShareGPT-4o-Text-VQA

    • huggingface.co
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fifth Civil Defender - 5CD (2025). Viet-ShareGPT-4o-Text-VQA [Dataset]. https://huggingface.co/datasets/5CD-AI/Viet-ShareGPT-4o-Text-VQA
    Explore at:
    Dataset updated
    Jan 27, 2025
    Dataset authored and provided by
    Fifth Civil Defender - 5CD
    Description

    Dataset Overview

    This dataset is was created from 42,678 Vietnamese 🇻🇳 images with the last GPT-4o. The dataset has superior quality compared to other existing datasets with:

    Highly detailed descriptions, from the overall composition of the image to descriptions of each object, including their location, quantity, etc. Descriptions of text include not only recognition but also the font style, color, position, and size of the text. Answers are very long and detailed, including… See the full description on the dataset page: https://huggingface.co/datasets/5CD-AI/Viet-ShareGPT-4o-Text-VQA.

  14. h

    Contexual-RAG-Relations-Dataset

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZySec AI (2025). Contexual-RAG-Relations-Dataset [Dataset]. https://huggingface.co/datasets/ZySec-AI/Contexual-RAG-Relations-Dataset
    Explore at:
    Dataset updated
    Mar 23, 2025
    Dataset authored and provided by
    ZySec AI
    Description

    Crawlify Pronoun Replacement Dataset

    This dataset contains conversation pairs for training a model to replace pronouns with full names and relevant details.

      Format
    

    Each example in the dataset follows the ShareGPT format: { "conversations": [ { "from": "system", "value": "system message" }, { "from": "human", "value": "input text" }, { "from": "assistant"… See the full description on the dataset page: https://huggingface.co/datasets/ZySec-AI/Contexual-RAG-Relations-Dataset.

  15. h

    Literotica-RP-Conversion-test-1

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    minipasila (2025). Literotica-RP-Conversion-test-1 [Dataset]. https://huggingface.co/datasets/mpasila/Literotica-RP-Conversion-test-1
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    minipasila
    Description

    Uses ShareGPT. This is just a quick test, I was gonna do more but Grok 3 is not that cheap.. and scaling it is gonna cost. But it seems to at least know what I wanted it to do. (Other models had annoying issues.) System prompt for the generation of this data: You're a bot that transforms stories into human/gpt roled conversations in ShareGPT formatting in .json meaning new lines use and so on. You're supposed to transform the story into a roleplay conversation between an user(human) and the… See the full description on the dataset page: https://huggingface.co/datasets/mpasila/Literotica-RP-Conversion-test-1.

  16. h

    CoTton-6k

    • huggingface.co
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Newstar Research ASIA (2025). CoTton-6k [Dataset]. https://huggingface.co/datasets/NewstaR/CoTton-6k
    Explore at:
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Newstar Research ASIA
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for CoTton-6k

      🧠 Dataset Summary
    

    CoTton-6k is a 6,000-example dataset of soft reasoning conversations in the ShareGPT format. Each entry contains an exchange between a user and a model, showcasing high-quality Chain-of-Thought (CoT) reasoning in natural language. The dataset is distilled from 3 cutting-edge open LLMs:

    Qwen3 AM Thinking QwQ

    The name CoTton encodes multiple layers of meaning:

    CoT — Chain-of-Thought is embedded in the name. TON — The dataset… See the full description on the dataset page: https://huggingface.co/datasets/NewstaR/CoTton-6k.

  17. h

    Reddit-Writing-SGPT

    • huggingface.co
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bintang Fortuna (2024). Reddit-Writing-SGPT [Dataset]. https://huggingface.co/datasets/BintangFortuna/Reddit-Writing-SGPT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2024
    Dataset authored and provided by
    Bintang Fortuna
    Description

    I forgot if this dataset is the dirty version of Reddit Writing Prompts or not, it's probably a mix of both. The data was filtered and classified using Lilac with two embedding models:

    jinaai/jina-embeddings-v2-base-en BAAI/bge-m3

    (Note: Lilac is amazing BTW, and the UI is nice. Highly recommended for data processing tasks) The dataset has been converted to ShareGPT format, including word counts for responses and labeled perspectives. While the labeling may not be 100% accurate, ambiguous… See the full description on the dataset page: https://huggingface.co/datasets/BintangFortuna/Reddit-Writing-SGPT.

  18. h

    ichikara-instruction-003-sharegpt

    • huggingface.co
    Updated Dec 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataPilot (2024). ichikara-instruction-003-sharegpt [Dataset]. https://huggingface.co/datasets/DataPilot/ichikara-instruction-003-sharegpt
    Explore at:
    Dataset updated
    Dec 21, 2024
    Dataset authored and provided by
    DataPilot
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ichikara-instruction-003-sharegpt Dataset by DataPilot

      データセット概要 (Dataset Summary)
    

    このデータセットは、kinokokoro/ichikara-instruction-003 で公開されている日本語インストラクションデータを、広く利用されている ShareGPT形式 に変換したものです。変換および公開は DataPilot が行いました。 元データセットは、様々な質問に対して人間が作成した回答が含まれており、日本語の大規模言語モデル(LLM)のファインチューニングに有用です。このShareGPT形式版は、特に会話形式のデータ入力を想定したモデルの学習に適しています。 注意: 元データセットには、1つの質問に対して複数の回答が存在する場合があります。このShareGPT形式データセットでは、各「質問と回答のペア」を独立した一つの会話データとして扱っています。

      データ形式 (Data Format)
    

    データはJSON… See the full description on the dataset page: https://huggingface.co/datasets/DataPilot/ichikara-instruction-003-sharegpt.

  19. h

    German-RAG-ORPO-ShareGPT-HESSIAN-AI

    • huggingface.co
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avemio AG (2024). German-RAG-ORPO-ShareGPT-HESSIAN-AI [Dataset]. https://huggingface.co/datasets/avemio/German-RAG-ORPO-ShareGPT-HESSIAN-AI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    Avemio AG
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    German-RAG-ORPO (Odds Ratio Preference Optimization) ShareGPT-Format

      German-RAG - German Retrieval Augmented Generation
    
    
    
    
    
      Dataset Summary
    

    The ORPO Tasks Dataset represents a specialized collection for fine-tuning language models with a focus on RAG-specific capabilities. The subsets can be for this training step are derived from 3 different sources:

    SauerkrautLM Preference Datasets: SauerkrautLM-Fermented-GER-DPO: is a specialized dataset designed for training… See the full description on the dataset page: https://huggingface.co/datasets/avemio/German-RAG-ORPO-ShareGPT-HESSIAN-AI.

  20. h

    function-calling_chatml_gemma_v1

    • huggingface.co
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicky (2024). function-calling_chatml_gemma_v1 [Dataset]. https://huggingface.co/datasets/NickyNicky/function-calling_chatml_gemma_v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2024
    Authors
    Nicky
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    take dataset.

    hiyouga/glaive-function-calling-v2-sharegpt

      image tokens (Min: 60 Max: 2099).
    
    
    
    
    
    
    
    
      format gemma template.
    

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BookingCare Technology .,JSC (2025). coqa-sharegpt-format [Dataset]. https://huggingface.co/datasets/BookingCare/coqa-sharegpt-format

coqa-sharegpt-format

BookingCare/coqa-sharegpt-format

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 4, 2025
Dataset provided by
https://bookingcare.vn/
Authors
BookingCare Technology .,JSC
Description

BookingCare/coqa-sharegpt-format dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu