18 datasets found

ultrachat_200k
huggingface.co
opendatalab.com
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
h
sft-ready-HuggingFaceH4-ultrachat-200k
huggingface.co
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Janati (2024). sft-ready-HuggingFaceH4-ultrachat-200k [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-HuggingFaceH4-ultrachat-200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2024
Authors
Ali Janati
Description
Na0s/sft-ready-HuggingFaceH4-ultrachat-200k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
HuggingFaceH4-ultrachat_200k
huggingface.co
Updated Mar 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish Prakash (2024). HuggingFaceH4-ultrachat_200k [Dataset]. https://huggingface.co/datasets/manishiitg/HuggingFaceH4-ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2024
Authors
Manish Prakash
Description
manishiitg/HuggingFaceH4-ultrachat_200k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
lilac-UltraChat-200k
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lilac AI (2024). lilac-UltraChat-200k [Dataset]. https://huggingface.co/datasets/lilacai/lilac-UltraChat-200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2024
Dataset authored and provided by
Lilac AI
Description
lilac/UltraChat-200k

This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k To download the dataset to a local directory: lilac download lilacai/lilac-UltraChat-200k

or from python with: ll.download("lilacai/lilac-UltraChat-200k")
h
ChatML-ultrachat_200k
huggingface.co
Updated Feb 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-ultrachat_200k [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2024
Authors
Victor Nogueira
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
HuggingFaceH4/ultrachat_200k in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

def format(columns): return { "text": tokenizer.apply_chat_template(columns["messages"], tokenize=False) }… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-ultrachat_200k.
h
ultrachat_2k
huggingface.co
Updated Jul 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neural Magic (2024). ultrachat_2k [Dataset]. https://huggingface.co/datasets/neuralmagic/ultrachat_2k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2024
Dataset authored and provided by
Neural Magic
Description
A small set of 2048 samples from HuggingFaceH4/ultrachat_200k for easy calibration.

Reproduction code

from datasets import load_dataset from huggingface_hub import HfApi

Constants

DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" SAMPLE_SIZE = 2048 NEW_DATASET_ID = "neuralmagic/ultrachat_2k"

Load, sample, and save dataset

sampled_ds = load_dataset(DATASET_ID, split=DATASET_SPLIT).shuffle(seed=42).select(range(SAMPLE_SIZE))… See the full description on the dataset page: https://huggingface.co/datasets/neuralmagic/ultrachat_2k.
h
HuggingFaceH4_ultrachat_200k_filtered_10k_sampled
huggingface.co
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jungki son (2025). HuggingFaceH4_ultrachat_200k_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrachat_200k_filtered_10k_sampled
Explore at:
Dataset updated
Apr 10, 2025
Authors
jungki son
Description
Origin Datasets: HuggingFaceH4/ultrachat_200k Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrachat_200k_filtered_10k_sampled.
h
UltraChat-200k-ShareGPT-clean
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip May (2024). UltraChat-200k-ShareGPT-clean [Dataset]. https://huggingface.co/datasets/PhilipMay/UltraChat-200k-ShareGPT-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Philip May
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
UltraChat-200k ShareGPT Clean

This dataset is cleaned and created with 01_convert_ultrachat_200k_train_sft.ipynb and 02_convert_ultrachat_200k_test_sft.ipynb based on HuggingFaceH4/ultrachat_200k (train_sft and test_sft). Main changes:

convert to conversations format which is supported by Axolotl - see ShareGPT clean invisible characters and strip - see mltb2.text.clean_all_invisible_chars_and_strip() remove rows with empty text

Licensing

Copyright (c) 2024 Philip… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/UltraChat-200k-ShareGPT-clean.
h
ultrachat_de
huggingface.co
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Björn Plüster (2023). ultrachat_de [Dataset]. https://huggingface.co/datasets/bjoernp/ultrachat_de
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2023
Authors
Björn Plüster
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
German UltraChat

This dataset contains the first 1k prompts from HuggingFaceH4/ultrachat_200k translated to German and inference on with GPT-4.
h
ultrachat_subsets
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mátyás Vincze (2025). ultrachat_subsets [Dataset]. https://huggingface.co/datasets/vinczematyas/ultrachat_subsets
Explore at:
Dataset updated
Apr 13, 2025
Authors
Mátyás Vincze
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[1k, 5k, 50k] random short prompts from HuggingFaceH4/ultrachat_200k.

How it was created

import numpy as np from datasets import load_dataset

np.random.seed(42)

dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") dataset = dataset.filter(lambda x: len(x["prompt"]) <= 1024) print(f"Number of short samples: {len(dataset)}")

for subset in ["1", "5", "50"]: dataset_subset = dataset.select(np.random.choice(len(dataset), int(subset) * 1000))… See the full description on the dataset page: https://huggingface.co/datasets/vinczematyas/ultrachat_subsets.
h
ultrachat_10k_nl
huggingface.co
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edwin Rijgersberg (2023). ultrachat_10k_nl [Dataset]. https://huggingface.co/datasets/Rijgersberg/ultrachat_10k_nl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2023
Authors
Edwin Rijgersberg
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for "ultrachat_10k_nl"

A translated version of 10k randomly selected examples from HuggingFaceH4/ultrachat_200k. Automatically translated by GPT-3.5.

More info

Read more about GEITje-chat, the datasets and the translation code in the 📄 README on GitHub.
h
HFH4_ultrachat_200k_ko
huggingface.co
Updated Sep 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ChuGyouk (2024). HFH4_ultrachat_200k_ko [Dataset]. https://huggingface.co/datasets/ChuGyouk/HFH4_ultrachat_200k_ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2024
Authors
ChuGyouk
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k Korean

Dataset Description

🎉 Translation finished! If there are any errors, please open the PR. 🎉 This is a Korean translated version of HuggingFaceH4/ultrachat_200k train-sft split, which is a heavily filtered version of the UltraChat dataset. I used solar-1-mini-translate-enko-240507. To see the detailed script on how I did it, please refer to the github repo: link The total cost was about 1300$.

Data Fields

prompt_id :… See the full description on the dataset page: https://huggingface.co/datasets/ChuGyouk/HFH4_ultrachat_200k_ko.
h
open-perfectblend
huggingface.co
Updated Oct 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxime Labonne (2024). open-perfectblend [Dataset]. https://huggingface.co/datasets/mlabonne/open-perfectblend
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2024
Authors
Maxime Labonne
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🎨 Open-PerfectBlend

Open-PerfectBlend is an open-source reproduction of the instruction dataset introduced in the paper "The Perfect Blend: Redefining RLHF with Mixture of Judges". It's a solid general-purpose instruction dataset with chat, math, code, and instruction-following data.

Data source

Here is the list of the datasets used in this mix:

Dataset

Samples

meta-math/MetaMathQA 395,000

openbmb/UltraInteract_sft 288,579

HuggingFaceH4/ultrachat_200k… See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/open-perfectblend.
h
zip2zip-1B-no-split
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EPFL Data Science Lab, zip2zip-1B-no-split [Dataset]. https://huggingface.co/datasets/epfl-dlab/zip2zip-1B-no-split
Explore at:
Dataset authored and provided by
EPFL Data Science Lab
Description
HuggingFaceFW/fineweb-edu (20%) (common knowledge) devngho/the-stack-llm-annotations-v2 (25%) (code) AI-MO/NuminaMath-1.5 (20%) (math) HuggingFaceH4/ultrachat_200k (20%) (chat) HuggingFaceFW/fineweb-2 (15%) (multilingual: [cmn_Hani, deu_Latn, jpn_Jpan, spa_Latn, fra_Latn, ita_Latn, por_Latn, nld_Latn, arb_Arab])
h
ultrachat_200k_nl
huggingface.co
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ReBatch (2024). ultrachat_200k_nl [Dataset]. https://huggingface.co/datasets/ReBatch/ultrachat_200k_nl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2024
Dataset authored and provided by
ReBatch
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for ultrachat_400k_nl

Dataset Description

This dataset is a tranlsation of HuggingFaceH4/ultrachat_200K using a MarianMT model. It contains multi-turn chat conversations between a user and an assistant.

Dataset Structure

The dataset has two splits; Only the SFT splits of the original dataset were translated.

train test

207858 23106

Usage

from datasets import load_dataset

ds = load_dataset("ReBatch/ultrachat_200k_nl")… See the full description on the dataset page: https://huggingface.co/datasets/ReBatch/ultrachat_200k_nl.
h
BB-Ultrachat-IndicLingual6-12k
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohan (2024). BB-Ultrachat-IndicLingual6-12k [Dataset]. https://huggingface.co/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Authors
Rohan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
BB-Ultrachat-IndicLingual6-12k

This dataset is created by bhaiyabot ai to enrich language model training data, especially in the context of Indic languages. code for creation is also open source at https://github.com/ro-hansolo/IndicTrans2HuggingFaceDatasets

Overview

BB-Ultrachat-IndicLingual6-12k is a curated dataset comprising 12,000 multi-turn conversations, which are a subset of the larger HuggingFaceH4/ultrachat_200k dataset. These conversations have been evenly… See the full description on the dataset page: https://huggingface.co/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k.
h
ko-openchat-0404
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heegyu Kim, ko-openchat-0404 [Dataset]. https://huggingface.co/datasets/heegyu/ko-openchat-0404
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Heegyu Kim
Description
한국어 챗봇 학습을 위해, 여러 데이터를 가져와서 포멧을 통일

heegyu/glaive-function-calling-v2-ko: 15170 items FreedomIntelligence/evol-instruct-korean: 59022 items heegyu/PKU-SafeRLHF-ko: 135213 items maywell/koVast: 684579 items MarkrAI/KoCommercial-Dataset: 175454 items HuggingFaceH4/ultrachat_200k: 207865 items Open-Orca/SlimOrca-Dedup: 363491 items glaiveai/glaive-code-assistant-v2: 215166 items
h
ultrachat_400k_nl
huggingface.co
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ReBatch (2024). ultrachat_400k_nl [Dataset]. https://huggingface.co/datasets/ReBatch/ultrachat_400k_nl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset authored and provided by
ReBatch
Description
Dataset Card for ultrachat_400k_nl

Dataset Description

This dataset is a combination 2 datasets for the Dutch Language. The first is a tranlsation of HuggingFaceH4/ultrachat_200K using a MarianMT model. It contains multi-turn chat conversations between a user and an assistant. The second is BramVanroy/ultrachat_200k_dutch. This is a recreation of ultrachat_200K in Dutch with gpt-4.

Dataset Structure

The dataset has two splits; Only the SFT splits of the… See the full description on the dataset page: https://huggingface.co/datasets/ReBatch/ultrachat_400k_nl.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k

ultrachat_200k

UltraChat 200k

HuggingFaceH4/ultrachat_200k

Explore at:

33 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 29, 2023

Dataset provided by

Hugging Facehttps://huggingface.co/

Authors

Hugging Face H4

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for UltraChat 200k

  Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.

Clear search

Close search

Google apps

Main menu

ultrachat_200k

sft-ready-HuggingFaceH4-ultrachat-200k

HuggingFaceH4-ultrachat_200k

lilac-UltraChat-200k

ChatML-ultrachat_200k

ultrachat_2k

Constants

Load, sample, and save dataset

HuggingFaceH4_ultrachat_200k_filtered_10k_sampled

UltraChat-200k-ShareGPT-clean

ultrachat_de

ultrachat_subsets

ultrachat_10k_nl

HFH4_ultrachat_200k_ko

open-perfectblend

Samples

zip2zip-1B-no-split

ultrachat_200k_nl

BB-Ultrachat-IndicLingual6-12k

ko-openchat-0404

ultrachat_400k_nl

ultrachat_200kSee More Versions

UltraChat 200k

HuggingFaceH4/ultrachat_200k

ultrachat_200k