The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.
Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.
The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².
(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Arena-Hard-Auto
Repo for storing pre-generated model answers and judgment for
Arena-Hard-v0.1 Arena-Hard-v2.0-Preview
Repo -> https://github.com/lmarena/arena-hard-auto Paper -> https://arxiv.org/abs/2406.11939
Citation
The code in this repository is developed from the papers below. Please cite it if you find the repository helpful. @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-hard-auto.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Ko-Arena-Hard-Auto
한국어 / English 리더보드 / 코드 ko-arena-hard-auto-v0.1는 한국어를 벤치마킹하기위한 자동 평가 도구의 질문 데이터셋입니다. 인간의 선호도와 높은 상관관계와 분리력을 가지고 있는 벤치마크 데이터셋인 arena-hard-auto-v0.1 를 GPT-4o와 o1을 사용하여 한국어로 번역하고 수작업으로 검수한 데이터셋입니다. 더 자세한 세부사항과 벤치마킹 결과는 ko-arena-hard-auto 코드를 참조하세요. 또한 원본 벤치마크에 관심이 있으시면 arena-hard-auto 코드를 참조하세요. 원래 문제의 형식을 유지하기 힘들어서 변경했습니다. 인덱스 : 1, 28, 29 문제를 한국어로 유도하기 위해 문제 형식을 변경했습니다. 원래는 코드만 존재합니다. 인덱스 : 30, 379, 190 참고문헌: @article{li2024crowdsourced, title={From Crowdsourced… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repository contains a range of Arena-Hard-Auto benchmark artifacts sourced as part of the 2024 paper Style Outweighs Substance. Repository Structure Model Responses for Arena Hard Auto Questions: data/ArenaHardAuto/model_answer
Our standard reference model for pairwise comparisons was gpt-4-0314.
Our standard set of comparison models was:
Llama-3-8B Variants: bagel-8b-v1.0, Llama-3-8B-Magpie-Align-SFT-v0.2, Llama-3-8B-Magpie-Align-v0.2, Llama-3-8B-Tulu-330K, Llama-3-8B-WildChat… See the full description on the dataset page: https://huggingface.co/datasets/DataShare/sos-artifacts.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
🧑🏻⚖️ JuStRank Judge Scores
A dataset of quality scores given by LLM judges and reward models for the outputs of many systems over the Arena Hard v0.1 benchmark. These judgment scores are the raw data we collected for the JuStRank paper from ACL 2025, on system-level LLM judge performance and behavior, and also used to create the JuStRank Leaderboard. In our research we tested 10 LLM judges and 8 reward models, and asked them to score the responses of 63 systems (generative… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/justrank_judge_scores.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🚀 RocketEval 🚀
🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
Github | OpenReview | Colab
This dataset contains the queries, generated checklist data, and responses data from 4 public benchmark datasets:
Dataset No. of Queries Comments
MT-Bench 160 Each 2-turn dialogue is split into 2 queries. AlpacaEval 805
Arena-Hard 500
WildBench 1,000 To fit the context window of lightweight LLMs, we use a subset of WildBench including 1000… See the full description on the dataset page: https://huggingface.co/datasets/wjkim9653/RocketEval-sLLMs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🏠 Homepage | 👨💻 Github | 🏆 Leaderboard | 📜 arXiv | 📝 blog | 🤗 HF Paper | 𝕏 Twitter
Benchmark correlations (%) with Chatbot Arena Elo, against the total costs of evaluating a single GPT-3.5-Turbo-0125 model. MixEval and MixEval-Hard show the highest correlations with Arena Elo and Arena Elo (En) among leading benchmarks. We reference the crowdsourcing price for Amazon Mechanical Turk ($0.05 per vote) when estimating the cost of evaluating a single model on Chatbot Arena… See the full description on the dataset page: https://huggingface.co/datasets/MixEval/MixEval.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"
Content
This data contains pairwise automatic win-rate evaluations for 2 benchmarks.
Outputs and judge decisions for the m-ArenaHard benchmark for sampled generations (5 each) from Aya Expanse 8B and Qwen2.5 7B Instruct. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.
Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.
The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².
(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.