43 datasets found
  1. h

    mt-bench

    • huggingface.co
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MInference (2024). mt-bench [Dataset]. https://huggingface.co/datasets/MInference/mt-bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2024
    Dataset authored and provided by
    MInference
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MInference/mt-bench dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. mt_bench_prompts

    • huggingface.co
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). mt_bench_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MT Bench by LMSYS

    This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.

      Dataset loading
    

    To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")

      Dataset creation
    

    To create the dataset, we do the following for our internal tooling.

    rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.

  3. h

    MultiTurn-Chat-MT-Bench

    • huggingface.co
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Singapore (2024). MultiTurn-Chat-MT-Bench [Dataset]. https://huggingface.co/datasets/aisingapore/MultiTurn-Chat-MT-Bench
    Explore at:
    Dataset updated
    Dec 19, 2024
    Dataset authored and provided by
    AI Singapore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SEA-MTBench

    SEA-MTBench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use gpt-4-1106-preview as the judge model and compare against gpt-3.5-turbo-0125 as the baseline model. It is based on MT-Bench and was manually translated by native speakers for Indonesian (id), Javanese (jv), Sundanese (su), and Vietnamese (vi). The Thai split of this dataset uses MT-Bench Thai from the ThaiLLM leaderboard.… See the full description on the dataset page: https://huggingface.co/datasets/aisingapore/MultiTurn-Chat-MT-Bench.

  4. h

    mt_bench_human_judgments

    • huggingface.co
    Updated Apr 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2024). mt_bench_human_judgments [Dataset]. https://huggingface.co/datasets/lmsys/mt_bench_human_judgments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2024
    Dataset authored and provided by
    Large Model Systems Organization
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Content

    This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our paper.

      Agreement Calculation
    

    This Colab notebook shows how to compute the… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/mt_bench_human_judgments.

  5. h

    mtbench-annotated-latest

    • huggingface.co
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bay-calibration-llm-evaluators (2024). mtbench-annotated-latest [Dataset]. https://huggingface.co/datasets/bay-calibration-llm-evaluators/mtbench-annotated-latest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Dataset authored and provided by
    bay-calibration-llm-evaluators
    Description

    MT-Bench-Select Dataset

      Introduction
    

    The MT-Bench-Select dataset is a refined subset of the original MT-Bench dataset introduced by Zheng et al. (2023). The original MT-Bench dataset comprises 80 questions with answers generated by six models. Each question and each pair of models form an evaluation task, resulting in 1,200 tasks. For this dataset, we used a curated subset of the original MT-Bench dataset, as prepared by the authors of the LLMBar paper (Zeng et al.… See the full description on the dataset page: https://huggingface.co/datasets/bay-calibration-llm-evaluators/mtbench-annotated-latest.

  6. h

    MT-Bench-ZH

    • huggingface.co
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Zhang (2024). MT-Bench-ZH [Dataset]. https://huggingface.co/datasets/GeneZC/MT-Bench-ZH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 9, 2024
    Authors
    Chen Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    💬 MT-Bench-ZH

    👻 GitHub

      🎯 Motivation
    

    MiniChat-1/1.5/2-3B are all instruction-following language models that could handle Chinese instructions, however, there is currently no instruciton-following benchamrk specialized for Chinese. Due to this, our previous evaluation has been limited to English-only benchmarks (i.e., AlpacaEval and MT-Bench). To this demand, MT-Bench-ZH is made to mitigate this. MT-Bench-ZH is basically translated from MT-Bench-ZH by GPT-4 and further… See the full description on the dataset page: https://huggingface.co/datasets/GeneZC/MT-Bench-ZH.

  7. MTBench-German

    • huggingface.co
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleph Alpha (2025). MTBench-German [Dataset]. https://huggingface.co/datasets/Aleph-Alpha/MTBench-German
    Explore at:
    Dataset updated
    May 4, 2025
    Dataset authored and provided by
    Aleph Alphahttps://aleph-alpha.com/
    Description

    MTBench-German

    This datasets provides a German version of MTBench. We provide patches on top of VAGOsolutions/MT-Bench-TrueGerman, correcting minor errors in some of the user prompts (for example, asking the model for gender-neutral pronouns, which do not exist in the German language), and adding compatibility for the current version of the MTBench GitHub repo. To use this dataset, run the patch_script.sh script. This will download the datasets from VAGOsolutions/MT-Bench-TrueGerman… See the full description on the dataset page: https://huggingface.co/datasets/Aleph-Alpha/MTBench-German.

  8. h

    mt-bench-thai

    • huggingface.co
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThaiLLM Leaderboard (2024). mt-bench-thai [Dataset]. https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    ThaiLLM Leaderboard
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MT-Bench Thai

    MT-Bench Thai is a dataset for multi-turn benchmarking that covers 9 categories.

    Writing Roleplay Extraction Reasoning Math Coding STEM Social Science Knowledge III

    We introduce the final category, Knowledge III, which evaluates understanding of Thai cultural context.

      Dataset Loading
    

    from datasets import load_dataset ds = load_dataset("ThaiLLM-Leaderboard/mt-bench-thai") print(ds)

    output DatasetDict({ train: Dataset({ features: ['question_id'… See the full description on the dataset page: https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai.

  9. h

    Long-MT-Bench-Plus

    • huggingface.co
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhuoshi Pan (2025). Long-MT-Bench-Plus [Dataset]. https://huggingface.co/datasets/panzs19/Long-MT-Bench-Plus
    Explore at:
    Dataset updated
    Apr 9, 2025
    Authors
    Zhuoshi Pan
    Description

    Long-MT-Bench+

    Long-MT-Bench+ is reconstructed from the MT-Bench+[1] and is more challenging for long-term conversations. [1] Junru Lu et al. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. 2023.

      Dataset Description
    

    Building on MT-Bench+, we use the human-written questions in MT-Bench+ as few-shot examples and ask GPT-4 to generate a long-range test question for each dialogue. Following [2], we merge five consecutive sessions… See the full description on the dataset page: https://huggingface.co/datasets/panzs19/Long-MT-Bench-Plus.

  10. h

    MTBench

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QingyuShi (2025). MTBench [Dataset]. https://huggingface.co/datasets/QingyuShi/MTBench
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    QingyuShi
    Description

    QingyuShi/MTBench dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    MT-Bench-TrueGerman

    • huggingface.co
    Updated May 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VAGO solutions (2024). MT-Bench-TrueGerman [Dataset]. https://huggingface.co/datasets/VAGOsolutions/MT-Bench-TrueGerman
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2024
    Dataset authored and provided by
    VAGO solutions
    Description

    Benchmark

    German Benchmarks on Hugging Face At present, there is a notable scarcity, if not a complete absence, of reliable and true German benchmarks designed to evaluate the capabilities of German Language Models (LLMs). While some efforts have been made to translate English benchmarks into German, these attempts often fall short in terms of precision, accuracy, and context sensitivity, even when employing GPT-4 technology. Take, for instance, the MT-Bench, a widely recognized and… See the full description on the dataset page: https://huggingface.co/datasets/VAGOsolutions/MT-Bench-TrueGerman.

  12. h

    MTBench-Human

    • huggingface.co
    Updated Jul 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chance Kim (2023). MTBench-Human [Dataset]. https://huggingface.co/datasets/tiratano/MTBench-Human
    Explore at:
    Dataset updated
    Jul 4, 2023
    Authors
    Chance Kim
    Description

    tiratano/MTBench-Human dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    MTBench-Eval

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chance Kim, MTBench-Eval [Dataset]. https://huggingface.co/datasets/tiratano/MTBench-Eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Chance Kim
    Description

    tiratano/MTBench-Eval dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    smugri-mt-bench

    • huggingface.co
    • dl.aifasthub.com
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TartuNLP (2025). smugri-mt-bench [Dataset]. https://huggingface.co/datasets/tartuNLP/smugri-mt-bench
    Explore at:
    Dataset updated
    Aug 30, 2025
    Dataset authored and provided by
    TartuNLP
    Description

    SMUGRI-MT-Bench

    This is a Finno-Ugric version of MT-Bench created to evaluate the multi-turn conversation and instruction-following capabilities of LLMs. It covers 4 (extremely) low-resource Finno-Ugric languages: Estonian, Livonian, Komi and Võro. SMUGRI-MT-Bench comprises of 80 single and multi-turn questions organized into four topics: math, reasoning, writing, and general. The questions are handpicked from LMSYS-Chat-1M dataset and manually translated into Estonian, Võro, Komi… See the full description on the dataset page: https://huggingface.co/datasets/tartuNLP/smugri-mt-bench.

  15. h

    mt-bench-tw

    • huggingface.co
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZoneTwelve (2025). mt-bench-tw [Dataset]. https://huggingface.co/datasets/ZoneTwelve/mt-bench-tw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2025
    Authors
    ZoneTwelve
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MT-Bench Dataset

    The MT-Bench dataset is a collection of challenging multi-turn, open-ended questions designed to evaluate chat assistants and language models. Using LLM-as-a-judge, this dataset leverages strong models like GPT-4 to assess response quality and provide automated grading. This README provides details on using and extending the dataset for evaluation purposes.

      Introduction
    

    There has been a proliferation of LLM-based chat assistants (chatbots) that leverage… See the full description on the dataset page: https://huggingface.co/datasets/ZoneTwelve/mt-bench-tw.

  16. h

    ja-mt-bench-1shot

    • huggingface.co
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shisa.AI (2024). ja-mt-bench-1shot [Dataset]. https://huggingface.co/datasets/shisa-ai/ja-mt-bench-1shot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    Shisa.AI
    Description

    shisa-ai/ja-mt-bench-1shot dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    MT-Bench

    • huggingface.co
    Updated Aug 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Anugraha (2025). MT-Bench [Dataset]. https://huggingface.co/datasets/davidanugraha/MT-Bench
    Explore at:
    Dataset updated
    Aug 16, 2025
    Authors
    David Anugraha
    Description

    davidanugraha/MT-Bench dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    mtbench-dpo-turn1-human

    • huggingface.co
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KyuheeKim (2025). mtbench-dpo-turn1-human [Dataset]. https://huggingface.co/datasets/koreankiwi99/mtbench-dpo-turn1-human
    Explore at:
    Dataset updated
    Jun 10, 2025
    Authors
    KyuheeKim
    Description

    koreankiwi99/mtbench-dpo-turn1-human dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    mt-bench-eval-critique

    • huggingface.co
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    distilabel-internal-testing (2024). mt-bench-eval-critique [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    distilabel-internal-testing
    Description

    Description

    This dataset is used to check criticon prompts/responses while testing, it contains instructions/responses from mt_bench_eval, as extracted from: https://github.com/kaistAI/prometheus/blob/main/evaluation/benchmark/data/mt_bench_eval.json The dataset has been obtained cleaning the data with: import re import pandas as pd from datasets import Dataset

    df = pd.read_json("mt_bench_eval.json", lines=True)

    ds = Dataset.from_pandas(df, preserve_index=False)… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/mt-bench-eval-critique.

  20. glider-mt-bench-suite

    • huggingface.co
    Updated Dec 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patronus AI (2024). glider-mt-bench-suite [Dataset]. https://huggingface.co/datasets/PatronusAI/glider-mt-bench-suite
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    Patronus AI, Inc.
    Authors
    Patronus AI
    Description

    PatronusAI/glider-mt-bench-suite dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
MInference (2024). mt-bench [Dataset]. https://huggingface.co/datasets/MInference/mt-bench

mt-bench

MInference/mt-bench

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset authored and provided by
MInference
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

MInference/mt-bench dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu