18 datasets found
  1. ultrachat_200k

    • huggingface.co
    • opendatalab.com
    Updated Oct 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for UltraChat 200k

      Dataset Description
    

    This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

    Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.

  2. h

    sft-ready-HuggingFaceH4-ultrachat-200k

    • huggingface.co
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Janati (2024). sft-ready-HuggingFaceH4-ultrachat-200k [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-HuggingFaceH4-ultrachat-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2024
    Authors
    Ali Janati
    Description

    Na0s/sft-ready-HuggingFaceH4-ultrachat-200k dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    HuggingFaceH4-ultrachat_200k

    • huggingface.co
    Updated Mar 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Prakash (2024). HuggingFaceH4-ultrachat_200k [Dataset]. https://huggingface.co/datasets/manishiitg/HuggingFaceH4-ultrachat_200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2024
    Authors
    Manish Prakash
    Description

    manishiitg/HuggingFaceH4-ultrachat_200k dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    lilac-UltraChat-200k

    • huggingface.co
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2024). lilac-UltraChat-200k [Dataset]. https://huggingface.co/datasets/lilacai/lilac-UltraChat-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2024
    Dataset authored and provided by
    Lilac AI
    Description

    lilac/UltraChat-200k

    This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k To download the dataset to a local directory: lilac download lilacai/lilac-UltraChat-200k

    or from python with: ll.download("lilacai/lilac-UltraChat-200k")

  5. h

    ChatML-ultrachat_200k

    • huggingface.co
    Updated Feb 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-ultrachat_200k [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-ultrachat_200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2024
    Authors
    Victor Nogueira
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    HuggingFaceH4/ultrachat_200k in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

    dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

    def format(columns): return { "text": tokenizer.apply_chat_template(columns["messages"], tokenize=False) }… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-ultrachat_200k.

  6. h

    ultrachat_2k

    • huggingface.co
    Updated Jul 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neural Magic (2024). ultrachat_2k [Dataset]. https://huggingface.co/datasets/neuralmagic/ultrachat_2k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2024
    Dataset authored and provided by
    Neural Magic
    Description

    A small set of 2048 samples from HuggingFaceH4/ultrachat_200k for easy calibration.

      Reproduction code
    

    from datasets import load_dataset from huggingface_hub import HfApi

    Constants

    DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" SAMPLE_SIZE = 2048 NEW_DATASET_ID = "neuralmagic/ultrachat_2k"

    Load, sample, and save dataset

    sampled_ds = load_dataset(DATASET_ID, split=DATASET_SPLIT).shuffle(seed=42).select(range(SAMPLE_SIZE))… See the full description on the dataset page: https://huggingface.co/datasets/neuralmagic/ultrachat_2k.

  7. h

    HuggingFaceH4_ultrachat_200k_filtered_10k_sampled

    • huggingface.co
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jungki son (2025). HuggingFaceH4_ultrachat_200k_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrachat_200k_filtered_10k_sampled
    Explore at:
    Dataset updated
    Apr 10, 2025
    Authors
    jungki son
    Description

    Origin Datasets: HuggingFaceH4/ultrachat_200k Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

    Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrachat_200k_filtered_10k_sampled.

  8. h

    UltraChat-200k-ShareGPT-clean

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip May (2024). UltraChat-200k-ShareGPT-clean [Dataset]. https://huggingface.co/datasets/PhilipMay/UltraChat-200k-ShareGPT-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Philip May
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    UltraChat-200k ShareGPT Clean

    This dataset is cleaned and created with 01_convert_ultrachat_200k_train_sft.ipynb and 02_convert_ultrachat_200k_test_sft.ipynb based on HuggingFaceH4/ultrachat_200k (train_sft and test_sft). Main changes:

    convert to conversations format which is supported by Axolotl - see ShareGPT clean invisible characters and strip - see mltb2.text.clean_all_invisible_chars_and_strip() remove rows with empty text

      Licensing
    

    Copyright (c) 2024 Philip… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/UltraChat-200k-ShareGPT-clean.

  9. h

    ultrachat_de

    • huggingface.co
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Björn Plüster (2023). ultrachat_de [Dataset]. https://huggingface.co/datasets/bjoernp/ultrachat_de
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2023
    Authors
    Björn Plüster
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    German UltraChat

    This dataset contains the first 1k prompts from HuggingFaceH4/ultrachat_200k translated to German and inference on with GPT-4.

  10. h

    ultrachat_subsets

    • huggingface.co
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mátyás Vincze (2025). ultrachat_subsets [Dataset]. https://huggingface.co/datasets/vinczematyas/ultrachat_subsets
    Explore at:
    Dataset updated
    Apr 13, 2025
    Authors
    Mátyás Vincze
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [1k, 5k, 50k] random short prompts from HuggingFaceH4/ultrachat_200k.

      How it was created
    

    import numpy as np from datasets import load_dataset

    np.random.seed(42)

    dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") dataset = dataset.filter(lambda x: len(x["prompt"]) <= 1024) print(f"Number of short samples: {len(dataset)}")

    for subset in ["1", "5", "50"]: dataset_subset = dataset.select(np.random.choice(len(dataset), int(subset) * 1000))… See the full description on the dataset page: https://huggingface.co/datasets/vinczematyas/ultrachat_subsets.

  11. h

    ultrachat_10k_nl

    • huggingface.co
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Rijgersberg (2023). ultrachat_10k_nl [Dataset]. https://huggingface.co/datasets/Rijgersberg/ultrachat_10k_nl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2023
    Authors
    Edwin Rijgersberg
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "ultrachat_10k_nl"

    A translated version of 10k randomly selected examples from HuggingFaceH4/ultrachat_200k. Automatically translated by GPT-3.5.

      More info
    

    Read more about GEITje-chat, the datasets and the translation code in the 📄 README on GitHub.

  12. h

    HFH4_ultrachat_200k_ko

    • huggingface.co
    Updated Sep 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChuGyouk (2024). HFH4_ultrachat_200k_ko [Dataset]. https://huggingface.co/datasets/ChuGyouk/HFH4_ultrachat_200k_ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2024
    Authors
    ChuGyouk
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for UltraChat 200k Korean

      Dataset Description
    

    🎉 Translation finished! If there are any errors, please open the PR. 🎉 This is a Korean translated version of HuggingFaceH4/ultrachat_200k train-sft split, which is a heavily filtered version of the UltraChat dataset. I used solar-1-mini-translate-enko-240507. To see the detailed script on how I did it, please refer to the github repo: link The total cost was about 1300$.

      Data Fields
    

    prompt_id :… See the full description on the dataset page: https://huggingface.co/datasets/ChuGyouk/HFH4_ultrachat_200k_ko.

  13. h

    open-perfectblend

    • huggingface.co
    Updated Oct 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Labonne (2024). open-perfectblend [Dataset]. https://huggingface.co/datasets/mlabonne/open-perfectblend
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2024
    Authors
    Maxime Labonne
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🎨 Open-PerfectBlend

    Open-PerfectBlend is an open-source reproduction of the instruction dataset introduced in the paper "The Perfect Blend: Redefining RLHF with Mixture of Judges". It's a solid general-purpose instruction dataset with chat, math, code, and instruction-following data.

      Data source
    

    Here is the list of the datasets used in this mix:

    Dataset

    Samples

    meta-math/MetaMathQA 395,000

    openbmb/UltraInteract_sft 288,579

    HuggingFaceH4/ultrachat_200k… See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/open-perfectblend.

  14. h

    zip2zip-1B-no-split

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EPFL Data Science Lab, zip2zip-1B-no-split [Dataset]. https://huggingface.co/datasets/epfl-dlab/zip2zip-1B-no-split
    Explore at:
    Dataset authored and provided by
    EPFL Data Science Lab
    Description

    HuggingFaceFW/fineweb-edu (20%) (common knowledge) devngho/the-stack-llm-annotations-v2 (25%) (code) AI-MO/NuminaMath-1.5 (20%) (math) HuggingFaceH4/ultrachat_200k (20%) (chat) HuggingFaceFW/fineweb-2 (15%) (multilingual: [cmn_Hani, deu_Latn, jpn_Jpan, spa_Latn, fra_Latn, ita_Latn, por_Latn, nld_Latn, arb_Arab])

  15. h

    ultrachat_200k_nl

    • huggingface.co
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ReBatch (2024). ultrachat_200k_nl [Dataset]. https://huggingface.co/datasets/ReBatch/ultrachat_200k_nl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    ReBatch
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for ultrachat_400k_nl

      Dataset Description
    

    This dataset is a tranlsation of HuggingFaceH4/ultrachat_200K using a MarianMT model. It contains multi-turn chat conversations between a user and an assistant.

      Dataset Structure
    

    The dataset has two splits; Only the SFT splits of the original dataset were translated.

    train test

    207858 23106

      Usage
    

    from datasets import load_dataset

    ds = load_dataset("ReBatch/ultrachat_200k_nl")… See the full description on the dataset page: https://huggingface.co/datasets/ReBatch/ultrachat_200k_nl.

  16. h

    BB-Ultrachat-IndicLingual6-12k

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan (2024). BB-Ultrachat-IndicLingual6-12k [Dataset]. https://huggingface.co/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Authors
    Rohan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    BB-Ultrachat-IndicLingual6-12k

    This dataset is created by bhaiyabot ai to enrich language model training data, especially in the context of Indic languages. code for creation is also open source at https://github.com/ro-hansolo/IndicTrans2HuggingFaceDatasets

      Overview
    

    BB-Ultrachat-IndicLingual6-12k is a curated dataset comprising 12,000 multi-turn conversations, which are a subset of the larger HuggingFaceH4/ultrachat_200k dataset. These conversations have been evenly… See the full description on the dataset page: https://huggingface.co/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k.

  17. h

    ko-openchat-0404

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heegyu Kim, ko-openchat-0404 [Dataset]. https://huggingface.co/datasets/heegyu/ko-openchat-0404
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Heegyu Kim
    Description

    한국어 챗봇 학습을 위해, 여러 데이터를 가져와서 포멧을 통일

    heegyu/glaive-function-calling-v2-ko: 15170 items FreedomIntelligence/evol-instruct-korean: 59022 items heegyu/PKU-SafeRLHF-ko: 135213 items maywell/koVast: 684579 items MarkrAI/KoCommercial-Dataset: 175454 items HuggingFaceH4/ultrachat_200k: 207865 items Open-Orca/SlimOrca-Dedup: 363491 items glaiveai/glaive-code-assistant-v2: 215166 items

  18. h

    ultrachat_400k_nl

    • huggingface.co
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ReBatch (2024). ultrachat_400k_nl [Dataset]. https://huggingface.co/datasets/ReBatch/ultrachat_400k_nl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset authored and provided by
    ReBatch
    Description

    Dataset Card for ultrachat_400k_nl

      Dataset Description
    

    This dataset is a combination 2 datasets for the Dutch Language. The first is a tranlsation of HuggingFaceH4/ultrachat_200K using a MarianMT model. It contains multi-turn chat conversations between a user and an assistant. The second is BramVanroy/ultrachat_200k_dutch. This is a recreation of ultrachat_200K in Dutch with gpt-4.

      Dataset Structure
    

    The dataset has two splits; Only the SFT splits of the… See the full description on the dataset page: https://huggingface.co/datasets/ReBatch/ultrachat_400k_nl.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Organization logo

ultrachat_200k

UltraChat 200k

HuggingFaceH4/ultrachat_200k

Explore at:
33 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for UltraChat 200k

  Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.

Search
Clear search
Close search
Google apps
Main menu