8 datasets found
  1. P

    Arena-Hard-Auto Dataset

    • paperswithcode.com
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica (2025). Arena-Hard-Auto Dataset [Dataset]. https://paperswithcode.com/dataset/arena-hard-auto
    Explore at:
    Dataset updated
    Jun 24, 2025
    Authors
    Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica
    Description

    The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

    Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

    The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

    (1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.

  2. h

    arena-hard-auto

    • huggingface.co
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMArena (2025). arena-hard-auto [Dataset]. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    LMArena
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Arena-Hard-Auto

    Repo for storing pre-generated model answers and judgment for

    Arena-Hard-v0.1 Arena-Hard-v2.0-Preview

    Repo -> https://github.com/lmarena/arena-hard-auto Paper -> https://arxiv.org/abs/2406.11939

      Citation
    

    The code in this repository is developed from the papers below. Please cite it if you find the repository helpful. @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-hard-auto.

  3. h

    ko-arena-hard-auto-v0.1

    • huggingface.co
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junjae Lee (2024). ko-arena-hard-auto-v0.1 [Dataset]. https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1
    Explore at:
    Dataset updated
    Dec 10, 2024
    Authors
    Junjae Lee
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Ko-Arena-Hard-Auto

    한국어 / English 리더보드 / 코드 ko-arena-hard-auto-v0.1는 한국어를 벤치마킹하기위한 자동 평가 도구의 질문 데이터셋입니다. 인간의 선호도와 높은 상관관계와 분리력을 가지고 있는 벤치마크 데이터셋인 arena-hard-auto-v0.1 를 GPT-4o와 o1을 사용하여 한국어로 번역하고 수작업으로 검수한 데이터셋입니다. 더 자세한 세부사항과 벤치마킹 결과는 ko-arena-hard-auto 코드를 참조하세요. 또한 원본 벤치마크에 관심이 있으시면 arena-hard-auto 코드를 참조하세요. 원래 문제의 형식을 유지하기 힘들어서 변경했습니다. 인덱스 : 1, 28, 29 문제를 한국어로 유도하기 위해 문제 형식을 변경했습니다. 원래는 코드만 존재합니다. 인덱스 : 30, 379, 190 참고문헌: @article{li2024crowdsourced, title={From Crowdsourced… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1.

  4. h

    sos-artifacts

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataShare (2025). sos-artifacts [Dataset]. https://huggingface.co/datasets/DataShare/sos-artifacts
    Explore at:
    Dataset updated
    Apr 26, 2025
    Dataset authored and provided by
    DataShare
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This repository contains a range of Arena-Hard-Auto benchmark artifacts sourced as part of the 2024 paper Style Outweighs Substance. Repository Structure Model Responses for Arena Hard Auto Questions: data/ArenaHardAuto/model_answer

    Our standard reference model for pairwise comparisons was gpt-4-0314.

    Our standard set of comparison models was:

    Llama-3-8B Variants: bagel-8b-v1.0, Llama-3-8B-Magpie-Align-SFT-v0.2, Llama-3-8B-Magpie-Align-v0.2, Llama-3-8B-Tulu-330K, Llama-3-8B-WildChat… See the full description on the dataset page: https://huggingface.co/datasets/DataShare/sos-artifacts.

  5. justrank_judge_scores

    • huggingface.co
    Updated Sep 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM Research (2024). justrank_judge_scores [Dataset]. https://huggingface.co/datasets/ibm-research/justrank_judge_scores
    Explore at:
    Dataset updated
    Sep 15, 2024
    Dataset provided by
    IBM Research
    IBMhttp://ibm.com/
    Authors
    IBM Research
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    🧑🏻‍⚖️ JuStRank Judge Scores

    A dataset of quality scores given by LLM judges and reward models for the outputs of many systems over the Arena Hard v0.1 benchmark. These judgment scores are the raw data we collected for the JuStRank paper from ACL 2025, on system-level LLM judge performance and behavior, and also used to create the JuStRank Leaderboard. In our research we tested 10 LLM judges and 8 reward models, and asked them to score the responses of 63 systems (generative… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/justrank_judge_scores.

  6. h

    RocketEval-sLLMs

    • huggingface.co
    Updated Apr 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wonjin Kim (2025). RocketEval-sLLMs [Dataset]. https://huggingface.co/datasets/wjkim9653/RocketEval-sLLMs
    Explore at:
    Dataset updated
    Apr 7, 2025
    Authors
    Wonjin Kim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🚀 RocketEval 🚀

    🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

    Github | OpenReview | Colab

    This dataset contains the queries, generated checklist data, and responses data from 4 public benchmark datasets:

    Dataset No. of Queries Comments

    MT-Bench 160 Each 2-turn dialogue is split into 2 queries. AlpacaEval 805

    Arena-Hard 500

    WildBench 1,000 To fit the context window of lightweight LLMs, we use a subset of WildBench including 1000… See the full description on the dataset page: https://huggingface.co/datasets/wjkim9653/RocketEval-sLLMs.

  7. h

    MixEval

    • huggingface.co
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MixEval (2024). MixEval [Dataset]. https://huggingface.co/datasets/MixEval/MixEval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2024
    Dataset authored and provided by
    MixEval
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🏠 Homepage | 👨‍💻 Github | 🏆 Leaderboard | 📜 arXiv | 📝 blog | 🤗 HF Paper | 𝕏 Twitter

    Benchmark correlations (%) with Chatbot Arena Elo, against the total costs of evaluating a single GPT-3.5-Turbo-0125 model. MixEval and MixEval-Hard show the highest correlations with Arena Elo and Arena Elo (En) among leading benchmarks. We reference the crowdsourcing price for Amazon Mechanical Turk ($0.05 per vote) when estimating the cost of evaluating a single model on Chatbot Arena… See the full description on the dataset page: https://huggingface.co/datasets/MixEval/MixEval.

  8. h

    deja-vu-pairwise-evals

    • huggingface.co
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere Labs (2025). deja-vu-pairwise-evals [Dataset]. https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Cohere Labs
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"

      Content
    

    This data contains pairwise automatic win-rate evaluations for 2 benchmarks.

    Outputs and judge decisions for the m-ArenaHard benchmark for sampled generations (5 each) from Aya Expanse 8B and Qwen2.5 7B Instruct. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica (2025). Arena-Hard-Auto Dataset [Dataset]. https://paperswithcode.com/dataset/arena-hard-auto

Arena-Hard-Auto Dataset

Explore at:
66 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 24, 2025
Authors
Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica
Description

The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.

Search
Clear search
Close search
Google apps
Main menu