BookingCare/coqa-sharegpt-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Code-74k-ShareGPT-Vicuna This dataset is in Vicuna/ShareGPT format. There are around 74000 set of conversations. Each set having 2 conversations. Python, Java, JavaScript, GO, C++, Rust etc. code with detailed explanation are provided. This dataset has around 60~65% of Python code.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Atma3.2-ShareGPT
This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.
Dataset Description
The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:
An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction
Usage… See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/Atma3.2-ShareGPT.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
shisa-v2-sharegpt
This is an updated version of the original shisa-v1 dataset augmxnt/ultra-orca-boros-en-ja-v1 and retains the same conversations field and sharegpt formatting to facilitate its use as drop-in replacement for the original dataset. The shisa-v2 revision filters a few entries, but largely retains the exact composition and prompts of the original.
All responses have been entirely regenerated from open weight models (Athene V2, Llama 3.3 70B, and Tulu 3 405B) Outputs… See the full description on the dataset page: https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Nectar ShareGPT Clean
This dataset is cleaned and created with 04_convert_nectar.ipynb based on berkeley-nest/Nectar. Main changes:
convert to conversations format which is supported by Axolotl - see ShareGPT only use best rank answers clean invisible characters and strip - see mltb2.text.clean_all_invisible_chars_and_strip() remove rows with empty text remove rows from multiple sources (see source column)
Licensing
Copyright (c) 2024 Philip MayCopyright (c) Banghua Zhu… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/Nectar-ShareGPT-clean.
Dataset Card for UltraChat 200k
This is just the original ultrachat 200k dataset converted to sharegpt format.
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster… See the full description on the dataset page: https://huggingface.co/datasets/abhinand/ultrachat_200k_sharegpt.
shidowake/cosmopedia-japanese-subset_from_aixsatoshi_filtered-sharegpt-format-no-system-prompt_split_5 dataset hosted on Hugging Face and contributed by the HF Datasets community
Merged UI Dataset: SCP_40k-claude-3-7-sonnet-16k-sharegpt
This dataset was automatically generated by merging and processing the following sources: mlfoundations-dev/SCP_40k-claude-3-7-sonnet-16k Generation Timestamp: 2025-04-03 17:50:36 Processing Time: 14.17 seconds Output Format: sharegpt
Processing Summary
Total Datasets Attempted: 1 Datasets Successfully Processed: 1 Datasets Failed/Skipped: 0 Total Input Rows Scanned: 49,603 Total Formatted Entries Generated: 49… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/SCP_40k-claude-3-7-sonnet-16k-sharegpt.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for ATCgpt-Fixed
This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.
Dataset Description
The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:
An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction
Usage
This… See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/ATCgpt-Fixed.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for MMLU-Alpaca
This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.
Dataset Description
The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:
An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction
Usage
This… See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/MMLU-Alpaca.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
I moved AEZAKMI V2 in sharegpt format to a different repo so that it's easier to use with HF datasets library.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
new version with more output examples, in sharegpt format
Dataset Overview
This dataset is was created from 42,678 Vietnamese 🇻🇳 images with the last GPT-4o. The dataset has superior quality compared to other existing datasets with:
Highly detailed descriptions, from the overall composition of the image to descriptions of each object, including their location, quantity, etc. Descriptions of text include not only recognition but also the font style, color, position, and size of the text. Answers are very long and detailed, including… See the full description on the dataset page: https://huggingface.co/datasets/5CD-AI/Viet-ShareGPT-4o-Text-VQA.
Crawlify Pronoun Replacement Dataset
This dataset contains conversation pairs for training a model to replace pronouns with full names and relevant details.
Format
Each example in the dataset follows the ShareGPT format: { "conversations": [ { "from": "system", "value": "system message" }, { "from": "human", "value": "input text" }, { "from": "assistant"… See the full description on the dataset page: https://huggingface.co/datasets/ZySec-AI/Contexual-RAG-Relations-Dataset.
Uses ShareGPT. This is just a quick test, I was gonna do more but Grok 3 is not that cheap.. and scaling it is gonna cost. But it seems to at least know what I wanted it to do. (Other models had annoying issues.) System prompt for the generation of this data: You're a bot that transforms stories into human/gpt roled conversations in ShareGPT formatting in .json meaning new lines use and so on. You're supposed to transform the story into a roleplay conversation between an user(human) and the… See the full description on the dataset page: https://huggingface.co/datasets/mpasila/Literotica-RP-Conversion-test-1.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for CoTton-6k
🧠 Dataset Summary
CoTton-6k is a 6,000-example dataset of soft reasoning conversations in the ShareGPT format. Each entry contains an exchange between a user and a model, showcasing high-quality Chain-of-Thought (CoT) reasoning in natural language. The dataset is distilled from 3 cutting-edge open LLMs:
Qwen3 AM Thinking QwQ
The name CoTton encodes multiple layers of meaning:
CoT — Chain-of-Thought is embedded in the name. TON — The dataset… See the full description on the dataset page: https://huggingface.co/datasets/NewstaR/CoTton-6k.
I forgot if this dataset is the dirty version of Reddit Writing Prompts or not, it's probably a mix of both. The data was filtered and classified using Lilac with two embedding models:
jinaai/jina-embeddings-v2-base-en BAAI/bge-m3
(Note: Lilac is amazing BTW, and the UI is nice. Highly recommended for data processing tasks) The dataset has been converted to ShareGPT format, including word counts for responses and labeled perspectives. While the labeling may not be 100% accurate, ambiguous… See the full description on the dataset page: https://huggingface.co/datasets/BintangFortuna/Reddit-Writing-SGPT.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
ichikara-instruction-003-sharegpt Dataset by DataPilot
データセット概要 (Dataset Summary)
このデータセットは、kinokokoro/ichikara-instruction-003 で公開されている日本語インストラクションデータを、広く利用されている ShareGPT形式 に変換したものです。変換および公開は DataPilot が行いました。 元データセットは、様々な質問に対して人間が作成した回答が含まれており、日本語の大規模言語モデル(LLM)のファインチューニングに有用です。このShareGPT形式版は、特に会話形式のデータ入力を想定したモデルの学習に適しています。 注意: 元データセットには、1つの質問に対して複数の回答が存在する場合があります。このShareGPT形式データセットでは、各「質問と回答のペア」を独立した一つの会話データとして扱っています。
データ形式 (Data Format)
データはJSON… See the full description on the dataset page: https://huggingface.co/datasets/DataPilot/ichikara-instruction-003-sharegpt.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
German-RAG-ORPO (Odds Ratio Preference Optimization) ShareGPT-Format
German-RAG - German Retrieval Augmented Generation
Dataset Summary
The ORPO Tasks Dataset represents a specialized collection for fine-tuning language models with a focus on RAG-specific capabilities. The subsets can be for this training step are derived from 3 different sources:
SauerkrautLM Preference Datasets: SauerkrautLM-Fermented-GER-DPO: is a specialized dataset designed for training… See the full description on the dataset page: https://huggingface.co/datasets/avemio/German-RAG-ORPO-ShareGPT-HESSIAN-AI.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
take dataset.
hiyouga/glaive-function-calling-v2-sharegpt
image tokens (Min: 60 Max: 2099).
format gemma template.
BookingCare/coqa-sharegpt-format dataset hosted on Hugging Face and contributed by the HF Datasets community