43 datasets found

h
mt-bench
huggingface.co
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MInference (2024). mt-bench [Dataset]. https://huggingface.co/datasets/MInference/mt-bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset authored and provided by
MInference
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MInference/mt-bench dataset hosted on Hugging Face and contributed by the HF Datasets community
mt_bench_prompts
huggingface.co
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). mt_bench_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MT Bench by LMSYS

This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.

Dataset loading

To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")

Dataset creation

To create the dataset, we do the following for our internal tooling.

rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.
h
MultiTurn-Chat-MT-Bench
huggingface.co
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Singapore (2024). MultiTurn-Chat-MT-Bench [Dataset]. https://huggingface.co/datasets/aisingapore/MultiTurn-Chat-MT-Bench
Explore at:
Dataset updated
Dec 19, 2024
Dataset authored and provided by
AI Singapore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SEA-MTBench

SEA-MTBench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use gpt-4-1106-preview as the judge model and compare against gpt-3.5-turbo-0125 as the baseline model. It is based on MT-Bench and was manually translated by native speakers for Indonesian (id), Javanese (jv), Sundanese (su), and Vietnamese (vi). The Thai split of this dataset uses MT-Bench Thai from the ThaiLLM leaderboard.… See the full description on the dataset page: https://huggingface.co/datasets/aisingapore/MultiTurn-Chat-MT-Bench.
h
mt_bench_human_judgments
huggingface.co
Updated Apr 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2024). mt_bench_human_judgments [Dataset]. https://huggingface.co/datasets/lmsys/mt_bench_human_judgments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2024
Dataset authored and provided by
Large Model Systems Organization
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Content

This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our paper.

Agreement Calculation

This Colab notebook shows how to compute the… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/mt_bench_human_judgments.
h
mtbench-annotated-latest
huggingface.co
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bay-calibration-llm-evaluators (2024). mtbench-annotated-latest [Dataset]. https://huggingface.co/datasets/bay-calibration-llm-evaluators/mtbench-annotated-latest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Dataset authored and provided by
bay-calibration-llm-evaluators
Description
MT-Bench-Select Dataset

Introduction

The MT-Bench-Select dataset is a refined subset of the original MT-Bench dataset introduced by Zheng et al. (2023). The original MT-Bench dataset comprises 80 questions with answers generated by six models. Each question and each pair of models form an evaluation task, resulting in 1,200 tasks. For this dataset, we used a curated subset of the original MT-Bench dataset, as prepared by the authors of the LLMBar paper (Zeng et al.… See the full description on the dataset page: https://huggingface.co/datasets/bay-calibration-llm-evaluators/mtbench-annotated-latest.
h
MT-Bench-ZH
huggingface.co
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Zhang (2024). MT-Bench-ZH [Dataset]. https://huggingface.co/datasets/GeneZC/MT-Bench-ZH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 9, 2024
Authors
Chen Zhang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
💬 MT-Bench-ZH

👻 GitHub

🎯 Motivation

MiniChat-1/1.5/2-3B are all instruction-following language models that could handle Chinese instructions, however, there is currently no instruciton-following benchamrk specialized for Chinese. Due to this, our previous evaluation has been limited to English-only benchmarks (i.e., AlpacaEval and MT-Bench). To this demand, MT-Bench-ZH is made to mitigate this. MT-Bench-ZH is basically translated from MT-Bench-ZH by GPT-4 and further… See the full description on the dataset page: https://huggingface.co/datasets/GeneZC/MT-Bench-ZH.
MTBench-German
huggingface.co
Updated May 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleph Alpha (2025). MTBench-German [Dataset]. https://huggingface.co/datasets/Aleph-Alpha/MTBench-German
Explore at:
Dataset updated
May 4, 2025
Dataset authored and provided by
Aleph Alphahttps://aleph-alpha.com/
Description
MTBench-German

This datasets provides a German version of MTBench. We provide patches on top of VAGOsolutions/MT-Bench-TrueGerman, correcting minor errors in some of the user prompts (for example, asking the model for gender-neutral pronouns, which do not exist in the German language), and adding compatibility for the current version of the MTBench GitHub repo. To use this dataset, run the patch_script.sh script. This will download the datasets from VAGOsolutions/MT-Bench-TrueGerman… See the full description on the dataset page: https://huggingface.co/datasets/Aleph-Alpha/MTBench-German.
h
mt-bench-thai
huggingface.co
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThaiLLM Leaderboard (2024). mt-bench-thai [Dataset]. https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2024
Dataset authored and provided by
ThaiLLM Leaderboard
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MT-Bench Thai

MT-Bench Thai is a dataset for multi-turn benchmarking that covers 9 categories.

Writing Roleplay Extraction Reasoning Math Coding STEM Social Science Knowledge III

We introduce the final category, Knowledge III, which evaluates understanding of Thai cultural context.

Dataset Loading

from datasets import load_dataset ds = load_dataset("ThaiLLM-Leaderboard/mt-bench-thai") print(ds)

output DatasetDict({ train: Dataset({ features: ['question_id'… See the full description on the dataset page: https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai.
h
Long-MT-Bench-Plus
huggingface.co
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhuoshi Pan (2025). Long-MT-Bench-Plus [Dataset]. https://huggingface.co/datasets/panzs19/Long-MT-Bench-Plus
Explore at:
Dataset updated
Apr 9, 2025
Authors
Zhuoshi Pan
Description
Long-MT-Bench+

Long-MT-Bench+ is reconstructed from the MT-Bench+[1] and is more challenging for long-term conversations. [1] Junru Lu et al. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. 2023.

Dataset Description

Building on MT-Bench+, we use the human-written questions in MT-Bench+ as few-shot examples and ask GPT-4 to generate a long-range test question for each dialogue. Following [2], we merge five consecutive sessions… See the full description on the dataset page: https://huggingface.co/datasets/panzs19/Long-MT-Bench-Plus.
h
MTBench
huggingface.co
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QingyuShi (2025). MTBench [Dataset]. https://huggingface.co/datasets/QingyuShi/MTBench
Explore at:
Dataset updated
Feb 20, 2025
Authors
QingyuShi
Description
QingyuShi/MTBench dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MT-Bench-TrueGerman
huggingface.co
Updated May 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VAGO solutions (2024). MT-Bench-TrueGerman [Dataset]. https://huggingface.co/datasets/VAGOsolutions/MT-Bench-TrueGerman
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2024
Dataset authored and provided by
VAGO solutions
Description
Benchmark

German Benchmarks on Hugging Face At present, there is a notable scarcity, if not a complete absence, of reliable and true German benchmarks designed to evaluate the capabilities of German Language Models (LLMs). While some efforts have been made to translate English benchmarks into German, these attempts often fall short in terms of precision, accuracy, and context sensitivity, even when employing GPT-4 technology. Take, for instance, the MT-Bench, a widely recognized and… See the full description on the dataset page: https://huggingface.co/datasets/VAGOsolutions/MT-Bench-TrueGerman.
h
MTBench-Human
huggingface.co
Updated Jul 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chance Kim (2023). MTBench-Human [Dataset]. https://huggingface.co/datasets/tiratano/MTBench-Human
Explore at:
Dataset updated
Jul 4, 2023
Authors
Chance Kim
Description
tiratano/MTBench-Human dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MTBench-Eval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chance Kim, MTBench-Eval [Dataset]. https://huggingface.co/datasets/tiratano/MTBench-Eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Chance Kim
Description
tiratano/MTBench-Eval dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smugri-mt-bench
huggingface.co
dl.aifasthub.com
Updated Aug 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TartuNLP (2025). smugri-mt-bench [Dataset]. https://huggingface.co/datasets/tartuNLP/smugri-mt-bench
Explore at:
Dataset updated
Aug 30, 2025
Dataset authored and provided by
TartuNLP
Description
SMUGRI-MT-Bench

This is a Finno-Ugric version of MT-Bench created to evaluate the multi-turn conversation and instruction-following capabilities of LLMs. It covers 4 (extremely) low-resource Finno-Ugric languages: Estonian, Livonian, Komi and Võro. SMUGRI-MT-Bench comprises of 80 single and multi-turn questions organized into four topics: math, reasoning, writing, and general. The questions are handpicked from LMSYS-Chat-1M dataset and manually translated into Estonian, Võro, Komi… See the full description on the dataset page: https://huggingface.co/datasets/tartuNLP/smugri-mt-bench.
h
mt-bench-tw
huggingface.co
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZoneTwelve (2025). mt-bench-tw [Dataset]. https://huggingface.co/datasets/ZoneTwelve/mt-bench-tw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Authors
ZoneTwelve
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MT-Bench Dataset

The MT-Bench dataset is a collection of challenging multi-turn, open-ended questions designed to evaluate chat assistants and language models. Using LLM-as-a-judge, this dataset leverages strong models like GPT-4 to assess response quality and provide automated grading. This README provides details on using and extending the dataset for evaluation purposes.

Introduction

There has been a proliferation of LLM-based chat assistants (chatbots) that leverage… See the full description on the dataset page: https://huggingface.co/datasets/ZoneTwelve/mt-bench-tw.
h
ja-mt-bench-1shot
huggingface.co
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shisa.AI (2024). ja-mt-bench-1shot [Dataset]. https://huggingface.co/datasets/shisa-ai/ja-mt-bench-1shot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2024
Dataset provided by
Shisa.AI
Description
shisa-ai/ja-mt-bench-1shot dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MT-Bench
huggingface.co
Updated Aug 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Anugraha (2025). MT-Bench [Dataset]. https://huggingface.co/datasets/davidanugraha/MT-Bench
Explore at:
Dataset updated
Aug 16, 2025
Authors
David Anugraha
Description
davidanugraha/MT-Bench dataset hosted on Hugging Face and contributed by the HF Datasets community
h
mtbench-dpo-turn1-human
huggingface.co
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KyuheeKim (2025). mtbench-dpo-turn1-human [Dataset]. https://huggingface.co/datasets/koreankiwi99/mtbench-dpo-turn1-human
Explore at:
Dataset updated
Jun 10, 2025
Authors
KyuheeKim
Description
koreankiwi99/mtbench-dpo-turn1-human dataset hosted on Hugging Face and contributed by the HF Datasets community
h
mt-bench-eval-critique
huggingface.co
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). mt-bench-eval-critique [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Description

This dataset is used to check criticon prompts/responses while testing, it contains instructions/responses from mt_bench_eval, as extracted from: https://github.com/kaistAI/prometheus/blob/main/evaluation/benchmark/data/mt_bench_eval.json The dataset has been obtained cleaning the data with: import re import pandas as pd from datasets import Dataset

df = pd.read_json("mt_bench_eval.json", lines=True)

ds = Dataset.from_pandas(df, preserve_index=False)… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique.
glider-mt-bench-suite
huggingface.co
Updated Dec 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patronus AI (2024). glider-mt-bench-suite [Dataset]. https://huggingface.co/datasets/PatronusAI/glider-mt-bench-suite
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2024
Dataset provided by
Patronus AI, Inc.
Authors
Patronus AI
Description
PatronusAI/glider-mt-bench-suite dataset hosted on Hugging Face and contributed by the HF Datasets community