8 datasets found

P
Arena-Hard-Auto Dataset
paperswithcode.com
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica (2025). Arena-Hard-Auto Dataset [Dataset]. https://paperswithcode.com/dataset/arena-hard-auto
Explore at:
Dataset updated
Jun 24, 2025
Authors
Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica
Description
The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard：开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.
h
arena-hard-auto
huggingface.co
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMArena (2025). arena-hard-auto [Dataset]. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
LMArena
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Arena-Hard-Auto

Repo for storing pre-generated model answers and judgment for

Arena-Hard-v0.1 Arena-Hard-v2.0-Preview

Repo -> https://github.com/lmarena/arena-hard-auto Paper -> https://arxiv.org/abs/2406.11939

Citation

The code in this repository is developed from the papers below. Please cite it if you find the repository helpful. @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-hard-auto.
h
ko-arena-hard-auto-v0.1
huggingface.co
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junjae Lee (2024). ko-arena-hard-auto-v0.1 [Dataset]. https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1
Explore at:
Dataset updated
Dec 10, 2024
Authors
Junjae Lee
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Ko-Arena-Hard-Auto

한국어 / English 리더보드 / 코드 ko-arena-hard-auto-v0.1는 한국어를 벤치마킹하기위한 자동 평가 도구의 질문 데이터셋입니다. 인간의 선호도와 높은 상관관계와 분리력을 가지고 있는 벤치마크 데이터셋인 arena-hard-auto-v0.1 를 GPT-4o와 o1을 사용하여 한국어로 번역하고 수작업으로 검수한 데이터셋입니다. 더 자세한 세부사항과 벤치마킹 결과는 ko-arena-hard-auto 코드를 참조하세요. 또한 원본 벤치마크에 관심이 있으시면 arena-hard-auto 코드를 참조하세요. 원래 문제의 형식을 유지하기 힘들어서 변경했습니다. 인덱스 : 1, 28, 29 문제를 한국어로 유도하기 위해 문제 형식을 변경했습니다. 원래는 코드만 존재합니다. 인덱스 : 30, 379, 190 참고문헌: @article{li2024crowdsourced, title={From Crowdsourced… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1.
h
sos-artifacts
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataShare (2025). sos-artifacts [Dataset]. https://huggingface.co/datasets/DataShare/sos-artifacts
Explore at:
Dataset updated
Apr 26, 2025
Dataset authored and provided by
DataShare
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repository contains a range of Arena-Hard-Auto benchmark artifacts sourced as part of the 2024 paper Style Outweighs Substance. Repository Structure Model Responses for Arena Hard Auto Questions: data/ArenaHardAuto/model_answer

Our standard reference model for pairwise comparisons was gpt-4-0314.

Our standard set of comparison models was:

Llama-3-8B Variants: bagel-8b-v1.0, Llama-3-8B-Magpie-Align-SFT-v0.2, Llama-3-8B-Magpie-Align-v0.2, Llama-3-8B-Tulu-330K, Llama-3-8B-WildChat… See the full description on the dataset page: https://huggingface.co/datasets/DataShare/sos-artifacts.
justrank_judge_scores
huggingface.co
Updated Sep 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBM Research (2024). justrank_judge_scores [Dataset]. https://huggingface.co/datasets/ibm-research/justrank_judge_scores
Explore at:
Dataset updated
Sep 15, 2024
Dataset provided by
IBM Research
IBMhttp://ibm.com/
Authors
IBM Research
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
🧑🏻‍⚖️ JuStRank Judge Scores

A dataset of quality scores given by LLM judges and reward models for the outputs of many systems over the Arena Hard v0.1 benchmark. These judgment scores are the raw data we collected for the JuStRank paper from ACL 2025, on system-level LLM judge performance and behavior, and also used to create the JuStRank Leaderboard. In our research we tested 10 LLM judges and 8 reward models, and asked them to score the responses of 63 systems (generative… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/justrank_judge_scores.
h
RocketEval-sLLMs
huggingface.co
Updated Apr 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wonjin Kim (2025). RocketEval-sLLMs [Dataset]. https://huggingface.co/datasets/wjkim9653/RocketEval-sLLMs
Explore at:
Dataset updated
Apr 7, 2025
Authors
Wonjin Kim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🚀 RocketEval 🚀

🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

Github | OpenReview | Colab

This dataset contains the queries, generated checklist data, and responses data from 4 public benchmark datasets:

Dataset No. of Queries Comments

MT-Bench 160 Each 2-turn dialogue is split into 2 queries. AlpacaEval 805

Arena-Hard 500

WildBench 1,000 To fit the context window of lightweight LLMs, we use a subset of WildBench including 1000… See the full description on the dataset page: https://huggingface.co/datasets/wjkim9653/RocketEval-sLLMs.
h
MixEval
huggingface.co
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MixEval (2024). MixEval [Dataset]. https://huggingface.co/datasets/MixEval/MixEval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2024
Dataset authored and provided by
MixEval
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🏠 Homepage | 👨‍💻 Github | 🏆 Leaderboard | 📜 arXiv | 📝 blog | 🤗 HF Paper | 𝕏 Twitter

Benchmark correlations (%) with Chatbot Arena Elo, against the total costs of evaluating a single GPT-3.5-Turbo-0125 model. MixEval and MixEval-Hard show the highest correlations with Arena Elo and Arena Elo (En) among leading benchmarks. We reference the crowdsourcing price for Amazon Mechanical Turk ($0.05 per vote) when estimating the cost of evaluating a single model on Chatbot Arena… See the full description on the dataset page: https://huggingface.co/datasets/MixEval/MixEval.
h
deja-vu-pairwise-evals
huggingface.co
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere Labs (2025). deja-vu-pairwise-evals [Dataset]. https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Cohere Labs
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"

Content

This data contains pairwise automatic win-rate evaluations for 2 benchmarks.

Outputs and judge decisions for the m-ArenaHard benchmark for sampled generations (5 each) from Aya Expanse 8B and Qwen2.5 7B Instruct. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica (2025). Arena-Hard-Auto Dataset [Dataset]. https://paperswithcode.com/dataset/arena-hard-auto

Arena-Hard-Auto Dataset

Explore at:

66 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jun 24, 2025

Authors

Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica

Description

The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard：开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.

Clear search

Close search

Google apps

Main menu

Arena-Hard-Auto Dataset

arena-hard-auto

ko-arena-hard-auto-v0.1

sos-artifacts

justrank_judge_scores

RocketEval-sLLMs

MixEval

deja-vu-pairwise-evals

Arena-Hard-Auto Dataset