https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for ShareGPT52K90K
Dataset Summary
This dataset is a collection of approximately 52,00090,000 conversations scraped via the ShareGPT API before it was shut down. These conversations include both user prompts and responses from OpenAI's ChatGPT. This repository now contains the new 90K conversations version. The previous 52K may be found in the old/ directory.
Supported Tasks and Leaderboards
text-generation
Languages
This dataset isโฆ See the full description on the dataset page: https://huggingface.co/datasets/RyokoAI/ShareGPT52K.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices:
Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":โฆ See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ShareGPT unfiltered dataset in RedPajama-Chat format
This dataset was created by converting The alpaca-lora formatted ShareGPT dataset to the format required by RedPajama-Chat. This script was used for the conversion: https://github.com/fredi-python/Alpaca2INCITE-Dataset-Converter/blob/main/convert.py WARNING: Only the first human and gpt text of each conversation from the original dataset is included in the dataset.
The format
{"text": "
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Simple "Reflection" method dataset inspired by mattshumer
This is the ShareGPT version. Find prompt and response pair dataset here
This dataset was synthetically generated using Glaive AI. There have been structure improvements and added more rows.
BookingCare/coqa-sharegpt-format dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/
Dataset Card: PIPPA-ShareGPT
This is a conversion of PygmalionAI's PIPPA deduped dataset to ShareGPT format for finetuning with Axolotl. The reformat was completed via the following TypeScript project called ShareGPT-Reformat.
Files and explanations
pippa_sharegpt_raw.jsonl: The raw deduped dataset file converted to shareGPT. Roles will be defaulted to your finetuning software. pippa_sharegpt.jsonl: A shareGPT dataset with the roles as USER: and CHARACTER: for finetuningโฆ See the full description on the dataset page: https://huggingface.co/datasets/kingbri/PIPPA-shareGPT.
mlabonne/WizardLM_evol_instruct_v2_196K-ShareGPT dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
OpenGVLab/ShareGPT-4o dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Deutsch ShareGPT data translated by gpt-3.5-turbo.The dataset is used in the research related to MultilingualSIFT.
arcee-ai/reasoning-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "guanaco-sharegpt-style"
More Information needed
abhinand/alpaca-gpt4-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
jwjiangb/sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
UltraChat dataset in ShareGPT format
This is the full UltraChat dataset converted to ShareGPT format.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
ShareGPT-Processed
The RyokoAI/ShareGPT52K dataset, converted to Markdown and labeled with the language used.
Acknowledgements
vinta/pangu.js โ To insert whitespace between CJK (Chinese, Japanese, Korean) and half-width characters (alphabetical letters, numerical digits and symbols). matthewwithanm/python-markdownify โ Provides a starting point to convert HTML to Markdown. BYVoid/OpenCC โ Conversions between Traditional Chinese and Simplified Chinese. aboSamoor/polyglotโฆ See the full description on the dataset page: https://huggingface.co/datasets/zetavg/ShareGPT-Processed.
horus-ai-labs/mmlu-sharegpt-all dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
๐ ShareGPT-4o-Image
ShareGPT-4o-Image is a large-scale and high-quality image generation dataset, where all images are produced by GPT-4oโs image generation capabilities. This dataset is designed to align open multimodal models with GPT-4oโs strengths in visual content creation. It includes 45K text-to-image and 46K text-and-image-to-image samples, making it a useful resource for enhancing multimodal models in both image generation and editing tasks.
Dataset Overviewโฆ See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/ShareGPT-4o-Image.
HappyAIUser/alpaca-sharegpt-data dataset hosted on Hugging Face and contributed by the HF Datasets community
synthetic_text_to_sql
ShareGPT version of gretelai/synthetic_text_to_sql using the following code: from datasets import load_dataset, DatasetDict
dataset = load_dataset('gretelai/synthetic_text_to_sql', split='all')
def format_sample(sample): conversations = [ { "from": "human", "value": f"{sample['sql_context']}
{sample['sql_prompt']}" }, { "from": "gpt", "value":โฆ See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/synthetic_text_to_sql-ShareGPT.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ShareGPT dataset for training OpenChat V3 series. See OpenChat repository for instructions. Contents:
sharegpt_clean.json: ShareGPT dataset in original format, converted to Markdown, and with model labels. sharegpt_gpt4.json: All instances in sharegpt_clean.json with model == "Model: GPT-4". *.parquet: Pre-tokenized dataset for training specified version of OpenChat.
Note: The dataset is NOT currently compatible with HF dataset loader. Licensed under MIT.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for ShareGPT52K90K
Dataset Summary
This dataset is a collection of approximately 52,00090,000 conversations scraped via the ShareGPT API before it was shut down. These conversations include both user prompts and responses from OpenAI's ChatGPT. This repository now contains the new 90K conversations version. The previous 52K may be found in the old/ directory.
Supported Tasks and Leaderboards
text-generation
Languages
This dataset isโฆ See the full description on the dataset page: https://huggingface.co/datasets/RyokoAI/ShareGPT52K.