MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MInference/mt-bench dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MT Bench by LMSYS
This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.
Dataset loading
To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")
Dataset creation
To create the dataset, we do the following for our internal tooling.
rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SEA-MTBench
SEA-MTBench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use gpt-4-1106-preview as the judge model and compare against gpt-3.5-turbo-0125 as the baseline model. It is based on MT-Bench and was manually translated by native speakers for Indonesian (id), Javanese (jv), Sundanese (su), and Vietnamese (vi). The Thai split of this dataset uses MT-Bench Thai from the ThaiLLM leaderboard.… See the full description on the dataset page: https://huggingface.co/datasets/aisingapore/MultiTurn-Chat-MT-Bench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Content
This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our paper.
Agreement Calculation
This Colab notebook shows how to compute the… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/mt_bench_human_judgments.
MT-Bench-Select Dataset
Introduction
The MT-Bench-Select dataset is a refined subset of the original MT-Bench dataset introduced by Zheng et al. (2023). The original MT-Bench dataset comprises 80 questions with answers generated by six models. Each question and each pair of models form an evaluation task, resulting in 1,200 tasks. For this dataset, we used a curated subset of the original MT-Bench dataset, as prepared by the authors of the LLMBar paper (Zeng et al.… See the full description on the dataset page: https://huggingface.co/datasets/bay-calibration-llm-evaluators/mtbench-annotated-latest.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
💬 MT-Bench-ZH
👻 GitHub
🎯 Motivation
MiniChat-1/1.5/2-3B are all instruction-following language models that could handle Chinese instructions, however, there is currently no instruciton-following benchamrk specialized for Chinese. Due to this, our previous evaluation has been limited to English-only benchmarks (i.e., AlpacaEval and MT-Bench). To this demand, MT-Bench-ZH is made to mitigate this. MT-Bench-ZH is basically translated from MT-Bench-ZH by GPT-4 and further… See the full description on the dataset page: https://huggingface.co/datasets/GeneZC/MT-Bench-ZH.
MTBench-German
This datasets provides a German version of MTBench. We provide patches on top of VAGOsolutions/MT-Bench-TrueGerman, correcting minor errors in some of the user prompts (for example, asking the model for gender-neutral pronouns, which do not exist in the German language), and adding compatibility for the current version of the MTBench GitHub repo. To use this dataset, run the patch_script.sh script. This will download the datasets from VAGOsolutions/MT-Bench-TrueGerman… See the full description on the dataset page: https://huggingface.co/datasets/Aleph-Alpha/MTBench-German.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MT-Bench Thai
MT-Bench Thai is a dataset for multi-turn benchmarking that covers 9 categories.
Writing Roleplay Extraction Reasoning Math Coding STEM Social Science Knowledge III
We introduce the final category, Knowledge III, which evaluates understanding of Thai cultural context.
Dataset Loading
from datasets import load_dataset ds = load_dataset("ThaiLLM-Leaderboard/mt-bench-thai") print(ds)
output DatasetDict({ train: Dataset({ features: ['question_id'… See the full description on the dataset page: https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai.
Long-MT-Bench+
Long-MT-Bench+ is reconstructed from the MT-Bench+[1] and is more challenging for long-term conversations. [1] Junru Lu et al. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. 2023.
Dataset Description
Building on MT-Bench+, we use the human-written questions in MT-Bench+ as few-shot examples and ask GPT-4 to generate a long-range test question for each dialogue. Following [2], we merge five consecutive sessions… See the full description on the dataset page: https://huggingface.co/datasets/panzs19/Long-MT-Bench-Plus.
QingyuShi/MTBench dataset hosted on Hugging Face and contributed by the HF Datasets community
Benchmark
German Benchmarks on Hugging Face At present, there is a notable scarcity, if not a complete absence, of reliable and true German benchmarks designed to evaluate the capabilities of German Language Models (LLMs). While some efforts have been made to translate English benchmarks into German, these attempts often fall short in terms of precision, accuracy, and context sensitivity, even when employing GPT-4 technology. Take, for instance, the MT-Bench, a widely recognized and… See the full description on the dataset page: https://huggingface.co/datasets/VAGOsolutions/MT-Bench-TrueGerman.
tiratano/MTBench-Human dataset hosted on Hugging Face and contributed by the HF Datasets community
tiratano/MTBench-Eval dataset hosted on Hugging Face and contributed by the HF Datasets community
SMUGRI-MT-Bench
This is a Finno-Ugric version of MT-Bench created to evaluate the multi-turn conversation and instruction-following capabilities of LLMs. It covers 4 (extremely) low-resource Finno-Ugric languages: Estonian, Livonian, Komi and Võro. SMUGRI-MT-Bench comprises of 80 single and multi-turn questions organized into four topics: math, reasoning, writing, and general. The questions are handpicked from LMSYS-Chat-1M dataset and manually translated into Estonian, Võro, Komi… See the full description on the dataset page: https://huggingface.co/datasets/tartuNLP/smugri-mt-bench.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MT-Bench Dataset
The MT-Bench dataset is a collection of challenging multi-turn, open-ended questions designed to evaluate chat assistants and language models. Using LLM-as-a-judge, this dataset leverages strong models like GPT-4 to assess response quality and provide automated grading. This README provides details on using and extending the dataset for evaluation purposes.
Introduction
There has been a proliferation of LLM-based chat assistants (chatbots) that leverage… See the full description on the dataset page: https://huggingface.co/datasets/ZoneTwelve/mt-bench-tw.
shisa-ai/ja-mt-bench-1shot dataset hosted on Hugging Face and contributed by the HF Datasets community
davidanugraha/MT-Bench dataset hosted on Hugging Face and contributed by the HF Datasets community
koreankiwi99/mtbench-dpo-turn1-human dataset hosted on Hugging Face and contributed by the HF Datasets community
Description
This dataset is used to check criticon prompts/responses while testing, it contains instructions/responses from mt_bench_eval, as extracted from: https://github.com/kaistAI/prometheus/blob/main/evaluation/benchmark/data/mt_bench_eval.json The dataset has been obtained cleaning the data with: import re import pandas as pd from datasets import Dataset
df = pd.read_json("mt_bench_eval.json", lines=True)
ds = Dataset.from_pandas(df, preserve_index=False)… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique.
PatronusAI/glider-mt-bench-suite dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MInference/mt-bench dataset hosted on Hugging Face and contributed by the HF Datasets community