MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
Na0s/sft-ready-HuggingFaceH4-ultrachat-200k dataset hosted on Hugging Face and contributed by the HF Datasets community
manishiitg/HuggingFaceH4-ultrachat_200k dataset hosted on Hugging Face and contributed by the HF Datasets community
lilac/UltraChat-200k
This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k To download the dataset to a local directory: lilac download lilacai/lilac-UltraChat-200k
or from python with: ll.download("lilacai/lilac-UltraChat-200k")
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
HuggingFaceH4/ultrachat_200k in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
def format(columns): return { "text": tokenizer.apply_chat_template(columns["messages"], tokenize=False) }… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-ultrachat_200k.
A small set of 2048 samples from HuggingFaceH4/ultrachat_200k for easy calibration.
Reproduction code
from datasets import load_dataset from huggingface_hub import HfApi
DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" SAMPLE_SIZE = 2048 NEW_DATASET_ID = "neuralmagic/ultrachat_2k"
sampled_ds = load_dataset(DATASET_ID, split=DATASET_SPLIT).shuffle(seed=42).select(range(SAMPLE_SIZE))… See the full description on the dataset page: https://huggingface.co/datasets/neuralmagic/ultrachat_2k.
Origin Datasets: HuggingFaceH4/ultrachat_200k Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:
Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrachat_200k_filtered_10k_sampled.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
UltraChat-200k ShareGPT Clean
This dataset is cleaned and created with 01_convert_ultrachat_200k_train_sft.ipynb and 02_convert_ultrachat_200k_test_sft.ipynb based on HuggingFaceH4/ultrachat_200k (train_sft and test_sft). Main changes:
convert to conversations format which is supported by Axolotl - see ShareGPT clean invisible characters and strip - see mltb2.text.clean_all_invisible_chars_and_strip() remove rows with empty text
Licensing
Copyright (c) 2024 Philip… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/UltraChat-200k-ShareGPT-clean.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
German UltraChat
This dataset contains the first 1k prompts from HuggingFaceH4/ultrachat_200k translated to German and inference on with GPT-4.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
[1k, 5k, 50k] random short prompts from HuggingFaceH4/ultrachat_200k.
How it was created
import numpy as np from datasets import load_dataset
np.random.seed(42)
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") dataset = dataset.filter(lambda x: len(x["prompt"]) <= 1024) print(f"Number of short samples: {len(dataset)}")
for subset in ["1", "5", "50"]: dataset_subset = dataset.select(np.random.choice(len(dataset), int(subset) * 1000))… See the full description on the dataset page: https://huggingface.co/datasets/vinczematyas/ultrachat_subsets.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "ultrachat_10k_nl"
A translated version of 10k randomly selected examples from HuggingFaceH4/ultrachat_200k. Automatically translated by GPT-3.5.
More info
Read more about GEITje-chat, the datasets and the translation code in the 📄 README on GitHub.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k Korean
Dataset Description
🎉 Translation finished! If there are any errors, please open the PR. 🎉 This is a Korean translated version of HuggingFaceH4/ultrachat_200k train-sft split, which is a heavily filtered version of the UltraChat dataset. I used solar-1-mini-translate-enko-240507. To see the detailed script on how I did it, please refer to the github repo: link The total cost was about 1300$.
Data Fields
prompt_id :… See the full description on the dataset page: https://huggingface.co/datasets/ChuGyouk/HFH4_ultrachat_200k_ko.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🎨 Open-PerfectBlend
Open-PerfectBlend is an open-source reproduction of the instruction dataset introduced in the paper "The Perfect Blend: Redefining RLHF with Mixture of Judges". It's a solid general-purpose instruction dataset with chat, math, code, and instruction-following data.
Data source
Here is the list of the datasets used in this mix:
Dataset
meta-math/MetaMathQA 395,000
openbmb/UltraInteract_sft 288,579
HuggingFaceH4/ultrachat_200k… See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/open-perfectblend.
HuggingFaceFW/fineweb-edu (20%) (common knowledge) devngho/the-stack-llm-annotations-v2 (25%) (code) AI-MO/NuminaMath-1.5 (20%) (math) HuggingFaceH4/ultrachat_200k (20%) (chat) HuggingFaceFW/fineweb-2 (15%) (multilingual: [cmn_Hani, deu_Latn, jpn_Jpan, spa_Latn, fra_Latn, ita_Latn, por_Latn, nld_Latn, arb_Arab])
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for ultrachat_400k_nl
Dataset Description
This dataset is a tranlsation of HuggingFaceH4/ultrachat_200K using a MarianMT model. It contains multi-turn chat conversations between a user and an assistant.
Dataset Structure
The dataset has two splits; Only the SFT splits of the original dataset were translated.
train test
207858 23106
Usage
from datasets import load_dataset
ds = load_dataset("ReBatch/ultrachat_200k_nl")… See the full description on the dataset page: https://huggingface.co/datasets/ReBatch/ultrachat_200k_nl.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
BB-Ultrachat-IndicLingual6-12k
This dataset is created by bhaiyabot ai to enrich language model training data, especially in the context of Indic languages. code for creation is also open source at https://github.com/ro-hansolo/IndicTrans2HuggingFaceDatasets
Overview
BB-Ultrachat-IndicLingual6-12k is a curated dataset comprising 12,000 multi-turn conversations, which are a subset of the larger HuggingFaceH4/ultrachat_200k dataset. These conversations have been evenly… See the full description on the dataset page: https://huggingface.co/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k.
한국어 챗봇 학습을 위해, 여러 데이터를 가져와서 포멧을 통일
heegyu/glaive-function-calling-v2-ko: 15170 items FreedomIntelligence/evol-instruct-korean: 59022 items heegyu/PKU-SafeRLHF-ko: 135213 items maywell/koVast: 684579 items MarkrAI/KoCommercial-Dataset: 175454 items HuggingFaceH4/ultrachat_200k: 207865 items Open-Orca/SlimOrca-Dedup: 363491 items glaiveai/glaive-code-assistant-v2: 215166 items
Dataset Card for ultrachat_400k_nl
Dataset Description
This dataset is a combination 2 datasets for the Dutch Language. The first is a tranlsation of HuggingFaceH4/ultrachat_200K using a MarianMT model. It contains multi-turn chat conversations between a user and an assistant. The second is BramVanroy/ultrachat_200k_dutch. This is a recreation of ultrachat_200K in Dutch with gpt-4.
Dataset Structure
The dataset has two splits; Only the SFT splits of the… See the full description on the dataset page: https://huggingface.co/datasets/ReBatch/ultrachat_400k_nl.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.