Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for m-ArenaHard
Dataset Details
The m-ArenaHard dataset is a multilingual LLM evaluation set. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. The original English-only prompts were created by Li et al. (2024) and consist of 500 challenging user queries sourced from Chatbot Arena. The authors show that these can be used… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard.
The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.
Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.
The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².
(1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Arena-Hard-Auto
Repo for storing pre-generated model answers and judgment for
Arena-Hard-v0.1 Arena-Hard-v2.0-Preview
Repo -> https://github.com/lmarena/arena-hard-auto Paper -> https://arxiv.org/abs/2406.11939
Citation
The code in this repository is developed from the papers below. Please cite it if you find the repository helpful. @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-hard-auto.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Ko-Arena-Hard-Auto
한국어 / English 리더보드 / 코드 ko-arena-hard-auto-v0.1는 한국어를 벤치마킹하기위한 자동 평가 도구의 질문 데이터셋입니다. 인간의 선호도와 높은 상관관계와 분리력을 가지고 있는 벤치마크 데이터셋인 arena-hard-auto-v0.1 를 GPT-4o와 o1을 사용하여 한국어로 번역하고 수작업으로 검수한 데이터셋입니다. 더 자세한 세부사항과 벤치마킹 결과는 ko-arena-hard-auto 코드를 참조하세요. 또한 원본 벤치마크에 관심이 있으시면 arena-hard-auto 코드를 참조하세요. 원래 문제의 형식을 유지하기 힘들어서 변경했습니다. 인덱스 : 1, 28, 29 문제를 한국어로 유도하기 위해 문제 형식을 변경했습니다. 원래는 코드만 존재합니다. 인덱스 : 30, 379, 190 참고문헌: @article{li2024crowdsourced, title={From Crowdsourced… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Checkout our blog post
Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1. robustly separate model capability 2. reflect human preference in real-world use cases 3. frequently update to avoid over-fitting or test set leakage
Traditional benchmarks are often static or close-ended (e.g., MMLU multi-choice QA), which do not satisfy the above requirements. On the other hand, models are evolving faster than ever, underscoring the need to build benchmarks with high separability.
We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals.
We compare our new benchmark, Arena Hard v0.1, to a current leading chat LLM benchmark, MT Bench. We show Arena Hard v0.1 offers significantly stronger separability against MT Bench with tighter confidence intervals. It also has a higher agreement (89.1%, see blog post) with the human preference ranking by Chatbot Arena (english-only). We expect to see this benchmark useful for model developers to differentiate their model checkpoints.
Dataset Card for m-ArenaHard-v2.0
This dataset is used in the paper When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs.
Dataset Details
The m-ArenaHard-v2.0 dataset is a multilingual LLM evaluation set. This is built on the LMarena (formerly LMSYS) arena-hard-auto-v2.0 test dataset. This dataset(containing 750 prompts) was filtered to "english" only prompts using the papluca/xlm-roberta-base-language-detection model resulting… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard-v2.0.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Greek m-ArenaHard
This is a translated version of LMArena's arena-hard-auto-v0.1 in Greek with Claude Sonnet 3.5 v2. This particular version originates from Cohere's m-ArenaHard which was originally translated using Google Translate API v3. We curated the dataset further by using Claude Sonnet 3.5 v2 to post-edit the translations originally provided with Google Translate API v3, as we noticed that some translated prompts (especially those related to coding) had not… See the full description on the dataset page: https://huggingface.co/datasets/ilsp/m-ArenaHard_greek.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ru-arena-hard
This is translated version of arena-hard-auto dataset for evaluation LLMs. The translation of the original dataset was done manually. In addition, content of each task in dataset was reviewed, the correctness of the task statement and compliance with moral and ethical standards were assessed. Thus, this dataset allows you to evaluate the abilities of language models to support the Russian language.
Overview of the Dataset
Original dataset:… See the full description on the dataset page: https://huggingface.co/datasets/t-tech/ru-arena-hard.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for radm/arenahard_gpt4vsllama3
The dataset was created for fine-tuning Llama-3-70B-Instruct as a judge on Arena Hard (https://github.com/lm-sys/arena-hard-auto)
Dataset Info
question_id: question id from Arena Hard instruction: original instruction from Arena Hard model: model whose responses are evaluated against the baseline model (gpt-4-0314) - gpt-4-turbo-2024-04-09 (score: 82.6) and Llama-2-70b-chat-hf (score: 11.6) input: responses of the evaluated… See the full description on the dataset page: https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Sources
Paper: BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Link: https://huggingface.co/papers/2502.07346 Repository: https://github.com/CONE-MT/BenchMAX
Dataset Description
BenchMAX_Model-based is a dataset of BenchMAX, sourcing from m-ArenaHard, which evaluates the instruction following capability via model-based judgment. We extend the original dataset to include languages that are not supported by m-ArenaHard through… See the full description on the dataset page: https://huggingface.co/datasets/LLaMAX/BenchMAX_Model-based.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repository contains a range of Arena-Hard-Auto benchmark artifacts sourced as part of the 2024 paper Style Outweighs Substance. Repository Structure Model Responses for Arena Hard Auto Questions: data/ArenaHardAuto/model_answer
Our standard reference model for pairwise comparisons was gpt-4-0314.
Our standard set of comparison models was:
Llama-3-8B Variants: bagel-8b-v1.0, Llama-3-8B-Magpie-Align-SFT-v0.2, Llama-3-8B-Magpie-Align-v0.2, Llama-3-8B-Tulu-330K, Llama-3-8B-WildChat… See the full description on the dataset page: https://huggingface.co/datasets/DataShare/sos-artifacts.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
qwopqwop님이 ko-arena-hard를 번역하신 데이터 qwopqwop/ko-arena-hard-auto-v0.1에 tag를 단 데이터입니다.
tag 정보
Category Count Description
Coding & Debugging 279 Users seek help with writing, reviewing, or fixing code in programming.
Planning 67 Users need assistance in creating plans or strategies for activities and projects.
Data analysis 31 Requests involve interpreting data, statistics, or performing analytical tasks.
Math 26 Queries related to mathematical concepts, problems, and… See the full description on the dataset page: https://huggingface.co/datasets/nwirandx/ko-arena-hard-auto-v0.1.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"
Content
This data contains pairwise automatic win-rate evaluations for 2 benchmarks.
Outputs and judge decisions for the m-ArenaHard benchmark for sampled generations (5 each) from Aya Expanse 8B and Qwen2.5 7B Instruct. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for m-ArenaHard
Dataset Details
The m-ArenaHard dataset is a multilingual LLM evaluation set. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. The original English-only prompts were created by Li et al. (2024) and consist of 500 challenging user queries sourced from Chatbot Arena. The authors show that these can be used… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard.