13 datasets found

h
m-ArenaHard
huggingface.co
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere Labs (2025). m-ArenaHard [Dataset]. https://huggingface.co/datasets/CohereLabs/m-ArenaHard
Explore at:
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Cohere Labs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for m-ArenaHard

Dataset Details

The m-ArenaHard dataset is a multilingual LLM evaluation set. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. The original English-only prompts were created by Li et al. (2024) and consist of 500 challenging user queries sourced from Chatbot Arena. The authors show that these can be used… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard.
P
Arena-Hard-Auto Dataset
paperswithcode.com
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica (2025). Arena-Hard-Auto Dataset [Dataset]. https://paperswithcode.com/dataset/arena-hard-auto
Explore at:
Dataset updated
Jun 24, 2025
Authors
Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica
Description
The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard：开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.
h
arena-hard-auto
huggingface.co
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMArena (2025). arena-hard-auto [Dataset]. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
LMArena
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Arena-Hard-Auto

Repo for storing pre-generated model answers and judgment for

Arena-Hard-v0.1 Arena-Hard-v2.0-Preview

Repo -> https://github.com/lmarena/arena-hard-auto Paper -> https://arxiv.org/abs/2406.11939

Citation

The code in this repository is developed from the papers below. Please cite it if you find the repository helpful. @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-hard-auto.
h
ko-arena-hard-auto-v0.1
huggingface.co
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junjae Lee (2024). ko-arena-hard-auto-v0.1 [Dataset]. https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1
Explore at:
Dataset updated
Dec 10, 2024
Authors
Junjae Lee
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Ko-Arena-Hard-Auto

한국어 / English 리더보드 / 코드 ko-arena-hard-auto-v0.1는 한국어를 벤치마킹하기위한 자동 평가 도구의 질문 데이터셋입니다. 인간의 선호도와 높은 상관관계와 분리력을 가지고 있는 벤치마크 데이터셋인 arena-hard-auto-v0.1 를 GPT-4o와 o1을 사용하여 한국어로 번역하고 수작업으로 검수한 데이터셋입니다. 더 자세한 세부사항과 벤치마킹 결과는 ko-arena-hard-auto 코드를 참조하세요. 또한 원본 벤치마크에 관심이 있으시면 arena-hard-auto 코드를 참조하세요. 원래 문제의 형식을 유지하기 힘들어서 변경했습니다. 인덱스 : 1, 28, 29 문제를 한국어로 유도하기 위해 문제 형식을 변경했습니다. 원래는 코드만 존재합니다. 인덱스 : 30, 379, 190 참고문헌: @article{li2024crowdsourced, title={From Crowdsourced… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1.
Arena-Hard-v0.1
kaggle.com
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMSYS ORG (2024). Arena-Hard-v0.1 [Dataset]. http://doi.org/10.34740/kaggle/dsv/8283907
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8283907
Dataset updated
May 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
LMSYS ORG
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Checkout our blog post

Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1. robustly separate model capability 2. reflect human preference in real-world use cases 3. frequently update to avoid over-fitting or test set leakage

Traditional benchmarks are often static or close-ended (e.g., MMLU multi-choice QA), which do not satisfy the above requirements. On the other hand, models are evolving faster than ever, underscoring the need to build benchmarks with high separability.

We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals.

We compare our new benchmark, Arena Hard v0.1, to a current leading chat LLM benchmark, MT Bench. We show Arena Hard v0.1 offers significantly stronger separability against MT Bench with tighter confidence intervals. It also has a higher agreement (89.1%, see blog post) with the human preference ranking by Chatbot Arena (english-only). We expect to see this benchmark useful for model developers to differentiate their model checkpoints.
h
m-ArenaHard-v2.0
huggingface.co
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere Labs (2024). m-ArenaHard-v2.0 [Dataset]. https://huggingface.co/datasets/CohereLabs/m-ArenaHard-v2.0
Explore at:
Dataset updated
Jul 31, 2024
Dataset authored and provided by
Cohere Labs
Description
Dataset Card for m-ArenaHard-v2.0

This dataset is used in the paper When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs.

Dataset Details

The m-ArenaHard-v2.0 dataset is a multilingual LLM evaluation set. This is built on the LMarena (formerly LMSYS) arena-hard-auto-v2.0 test dataset. This dataset(containing 750 prompts) was filtered to "english" only prompts using the papluca/xlm-roberta-base-language-detection model resulting… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard-v2.0.
m-ArenaHard_greek
huggingface.co
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute for Language and Speech Processing (2025). m-ArenaHard_greek [Dataset]. https://huggingface.co/datasets/ilsp/m-ArenaHard_greek
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Institute for Language and Speech Processinghttp://www.ilsp.gr/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Greek m-ArenaHard

This is a translated version of LMArena's arena-hard-auto-v0.1 in Greek with Claude Sonnet 3.5 v2. This particular version originates from Cohere's m-ArenaHard which was originally translated using Google Translate API v3. We curated the dataset further by using Claude Sonnet 3.5 v2 to post-edit the translations originally provided with Google Translate API v3, as we noticed that some translated prompts (especially those related to coding) had not… See the full description on the dataset page: https://huggingface.co/datasets/ilsp/m-ArenaHard_greek.
h
ru-arena-hard
huggingface.co
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T-Tech (2025). ru-arena-hard [Dataset]. https://huggingface.co/datasets/t-tech/ru-arena-hard
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2025
Dataset authored and provided by
T-Tech
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ru-arena-hard

This is translated version of arena-hard-auto dataset for evaluation LLMs. The translation of the original dataset was done manually. In addition, content of each task in dataset was reviewed, the correctness of the task statement and compliance with moral and ethical standards were assessed. Thus, this dataset allows you to evaluate the abilities of language models to support the Russian language.

Overview of the Dataset

Original dataset:… See the full description on the dataset page: https://huggingface.co/datasets/t-tech/ru-arena-hard.
h
arenahard_gpt4vsllama3
huggingface.co
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
r4dm (2024). arenahard_gpt4vsllama3 [Dataset]. https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2024
Authors
r4dm
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for radm/arenahard_gpt4vsllama3

The dataset was created for fine-tuning Llama-3-70B-Instruct as a judge on Arena Hard (https://github.com/lm-sys/arena-hard-auto)

Dataset Info

question_id: question id from Arena Hard instruction: original instruction from Arena Hard model: model whose responses are evaluated against the baseline model (gpt-4-0314) - gpt-4-turbo-2024-04-09 (score: 82.6) and Llama-2-70b-chat-hf (score: 11.6) input: responses of the evaluated… See the full description on the dataset page: https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3.
h
BenchMAX_Model-based
huggingface.co
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLaMAX (2025). BenchMAX_Model-based [Dataset]. https://huggingface.co/datasets/LLaMAX/BenchMAX_Model-based
Explore at:
Dataset updated
Feb 11, 2025
Authors
LLaMAX
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Sources

Paper: BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Link: https://huggingface.co/papers/2502.07346 Repository: https://github.com/CONE-MT/BenchMAX

Dataset Description

BenchMAX_Model-based is a dataset of BenchMAX, sourcing from m-ArenaHard, which evaluates the instruction following capability via model-based judgment. We extend the original dataset to include languages that are not supported by m-ArenaHard through… See the full description on the dataset page: https://huggingface.co/datasets/LLaMAX/BenchMAX_Model-based.
h
sos-artifacts
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataShare (2025). sos-artifacts [Dataset]. https://huggingface.co/datasets/DataShare/sos-artifacts
Explore at:
Dataset updated
Apr 26, 2025
Dataset authored and provided by
DataShare
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repository contains a range of Arena-Hard-Auto benchmark artifacts sourced as part of the 2024 paper Style Outweighs Substance. Repository Structure Model Responses for Arena Hard Auto Questions: data/ArenaHardAuto/model_answer

Our standard reference model for pairwise comparisons was gpt-4-0314.

Our standard set of comparison models was:

Llama-3-8B Variants: bagel-8b-v1.0, Llama-3-8B-Magpie-Align-SFT-v0.2, Llama-3-8B-Magpie-Align-v0.2, Llama-3-8B-Tulu-330K, Llama-3-8B-WildChat… See the full description on the dataset page: https://huggingface.co/datasets/DataShare/sos-artifacts.
h
ko-arena-hard-auto-v0.1
huggingface.co
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Park (2025). ko-arena-hard-auto-v0.1 [Dataset]. https://huggingface.co/datasets/nwirandx/ko-arena-hard-auto-v0.1
Explore at:
Dataset updated
Feb 12, 2025
Authors
Park
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
qwopqwop님이 ko-arena-hard를 번역하신 데이터 qwopqwop/ko-arena-hard-auto-v0.1에 tag를 단 데이터입니다.

tag 정보

Category Count Description

Coding & Debugging 279 Users seek help with writing, reviewing, or fixing code in programming.

Planning 67 Users need assistance in creating plans or strategies for activities and projects.

Data analysis 31 Requests involve interpreting data, statistics, or performing analytical tasks.

Math 26 Queries related to mathematical concepts, problems, and… See the full description on the dataset page: https://huggingface.co/datasets/nwirandx/ko-arena-hard-auto-v0.1.
h
deja-vu-pairwise-evals
huggingface.co
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere Labs (2025). deja-vu-pairwise-evals [Dataset]. https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Cohere Labs
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"

Content

This data contains pairwise automatic win-rate evaluations for 2 benchmarks.

Outputs and judge decisions for the m-ArenaHard benchmark for sampled generations (5 each) from Aya Expanse 8B and Qwen2.5 7B Instruct. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cohere Labs (2025). m-ArenaHard [Dataset]. https://huggingface.co/datasets/CohereLabs/m-ArenaHard

m-ArenaHard

CohereLabs/m-ArenaHard

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 3, 2025

Dataset authored and provided by

Cohere Labs

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for m-ArenaHard

  Dataset Details

The m-ArenaHard dataset is a multilingual LLM evaluation set. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. The original English-only prompts were created by Li et al. (2024) and consist of 500 challenging user queries sourced from Chatbot Arena. The authors show that these can be used… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard.

Clear search

Close search

Google apps

Main menu

m-ArenaHard

Arena-Hard-Auto Dataset

arena-hard-auto

ko-arena-hard-auto-v0.1

Arena-Hard-v0.1

m-ArenaHard-v2.0

m-ArenaHard_greek

ru-arena-hard

arenahard_gpt4vsllama3

BenchMAX_Model-based

sos-artifacts

ko-arena-hard-auto-v0.1

deja-vu-pairwise-evals

m-ArenaHard

CohereLabs/m-ArenaHard