13 datasets found
  1. h

    m-ArenaHard

    • huggingface.co
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere Labs (2025). m-ArenaHard [Dataset]. https://huggingface.co/datasets/CohereLabs/m-ArenaHard
    Explore at:
    Dataset updated
    Mar 3, 2025
    Dataset authored and provided by
    Cohere Labs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for m-ArenaHard

      Dataset Details
    

    The m-ArenaHard dataset is a multilingual LLM evaluation set. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. The original English-only prompts were created by Li et al. (2024) and consist of 500 challenging user queries sourced from Chatbot Arena. The authors show that these can be used… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard.

  2. P

    Arena-Hard-Auto Dataset

    • paperswithcode.com
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica (2025). Arena-Hard-Auto Dataset [Dataset]. https://paperswithcode.com/dataset/arena-hard-auto
    Explore at:
    Dataset updated
    Jun 24, 2025
    Authors
    Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica
    Description

    The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

    Here are some key features of the Arena-Hard-Auto benchmark: - It contains 500 challenging user queries¹. - It uses GPT-4-Turbo as a judge to compare the models' responses against a baseline model (default: GPT-4-0314)¹. - It employs an automatic judge as a cheaper and faster approximator to human preference¹. - It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks¹. - If you are curious to see how well your model might perform on Chatbot Arena, Arena-Hard-Auto is recommended¹.

    The benchmark is built from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evaluations⁴. It offers significantly stronger separability against other benchmarks with tighter confidence intervals².

    (1) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (2) Arena Hard – UC Berkeley Sky Computing. https://sky.cs.berkeley.edu/project/arena-hard/. (3) From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline .... https://lmsys.org/blog/2024-04-19-arena-hard/. (4) GitHub - lm-sys/arena-hard-auto: Arena-Hard-Auto: An automatic LLM .... https://github.com/lm-sys/arena-hard-auto. (5) Arena-Hard:开源高质量大模型评估基准-CSDN博客. https://blog.csdn.net/weixin_57291105/article/details/138132998.

  3. h

    arena-hard-auto

    • huggingface.co
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMArena (2025). arena-hard-auto [Dataset]. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    LMArena
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Arena-Hard-Auto

    Repo for storing pre-generated model answers and judgment for

    Arena-Hard-v0.1 Arena-Hard-v2.0-Preview

    Repo -> https://github.com/lmarena/arena-hard-auto Paper -> https://arxiv.org/abs/2406.11939

      Citation
    

    The code in this repository is developed from the papers below. Please cite it if you find the repository helpful. @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-hard-auto.

  4. h

    ko-arena-hard-auto-v0.1

    • huggingface.co
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junjae Lee (2024). ko-arena-hard-auto-v0.1 [Dataset]. https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1
    Explore at:
    Dataset updated
    Dec 10, 2024
    Authors
    Junjae Lee
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Ko-Arena-Hard-Auto

    한국어 / English 리더보드 / 코드 ko-arena-hard-auto-v0.1는 한국어를 벤치마킹하기위한 자동 평가 도구의 질문 데이터셋입니다. 인간의 선호도와 높은 상관관계와 분리력을 가지고 있는 벤치마크 데이터셋인 arena-hard-auto-v0.1 를 GPT-4o와 o1을 사용하여 한국어로 번역하고 수작업으로 검수한 데이터셋입니다. 더 자세한 세부사항과 벤치마킹 결과는 ko-arena-hard-auto 코드를 참조하세요. 또한 원본 벤치마크에 관심이 있으시면 arena-hard-auto 코드를 참조하세요. 원래 문제의 형식을 유지하기 힘들어서 변경했습니다. 인덱스 : 1, 28, 29 문제를 한국어로 유도하기 위해 문제 형식을 변경했습니다. 원래는 코드만 존재합니다. 인덱스 : 30, 379, 190 참고문헌: @article{li2024crowdsourced, title={From Crowdsourced… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ko-arena-hard-auto-v0.1.

  5. Arena-Hard-v0.1

    • kaggle.com
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMSYS ORG (2024). Arena-Hard-v0.1 [Dataset]. http://doi.org/10.34740/kaggle/dsv/8283907
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    LMSYS ORG
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Checkout our blog post

    Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1. robustly separate model capability 2. reflect human preference in real-world use cases 3. frequently update to avoid over-fitting or test set leakage

    Traditional benchmarks are often static or close-ended (e.g., MMLU multi-choice QA), which do not satisfy the above requirements. On the other hand, models are evolving faster than ever, underscoring the need to build benchmarks with high separability.

    We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals.

    We compare our new benchmark, Arena Hard v0.1, to a current leading chat LLM benchmark, MT Bench. We show Arena Hard v0.1 offers significantly stronger separability against MT Bench with tighter confidence intervals. It also has a higher agreement (89.1%, see blog post) with the human preference ranking by Chatbot Arena (english-only). We expect to see this benchmark useful for model developers to differentiate their model checkpoints.

  6. h

    m-ArenaHard-v2.0

    • huggingface.co
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere Labs (2024). m-ArenaHard-v2.0 [Dataset]. https://huggingface.co/datasets/CohereLabs/m-ArenaHard-v2.0
    Explore at:
    Dataset updated
    Jul 31, 2024
    Dataset authored and provided by
    Cohere Labs
    Description

    Dataset Card for m-ArenaHard-v2.0

    This dataset is used in the paper When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs.

      Dataset Details
    

    The m-ArenaHard-v2.0 dataset is a multilingual LLM evaluation set. This is built on the LMarena (formerly LMSYS) arena-hard-auto-v2.0 test dataset. This dataset(containing 750 prompts) was filtered to "english" only prompts using the papluca/xlm-roberta-base-language-detection model resulting… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard-v2.0.

  7. m-ArenaHard_greek

    • huggingface.co
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Language and Speech Processing (2025). m-ArenaHard_greek [Dataset]. https://huggingface.co/datasets/ilsp/m-ArenaHard_greek
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2025
    Dataset authored and provided by
    Institute for Language and Speech Processinghttp://www.ilsp.gr/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Greek m-ArenaHard

    This is a translated version of LMArena's arena-hard-auto-v0.1 in Greek with Claude Sonnet 3.5 v2. This particular version originates from Cohere's m-ArenaHard which was originally translated using Google Translate API v3. We curated the dataset further by using Claude Sonnet 3.5 v2 to post-edit the translations originally provided with Google Translate API v3, as we noticed that some translated prompts (especially those related to coding) had not… See the full description on the dataset page: https://huggingface.co/datasets/ilsp/m-ArenaHard_greek.

  8. h

    ru-arena-hard

    • huggingface.co
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T-Tech (2025). ru-arena-hard [Dataset]. https://huggingface.co/datasets/t-tech/ru-arena-hard
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 18, 2025
    Dataset authored and provided by
    T-Tech
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ru-arena-hard

    This is translated version of arena-hard-auto dataset for evaluation LLMs. The translation of the original dataset was done manually. In addition, content of each task in dataset was reviewed, the correctness of the task statement and compliance with moral and ethical standards were assessed. Thus, this dataset allows you to evaluate the abilities of language models to support the Russian language.

      Overview of the Dataset
    

    Original dataset:… See the full description on the dataset page: https://huggingface.co/datasets/t-tech/ru-arena-hard.

  9. h

    arenahard_gpt4vsllama3

    • huggingface.co
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    r4dm (2024). arenahard_gpt4vsllama3 [Dataset]. https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2024
    Authors
    r4dm
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for radm/arenahard_gpt4vsllama3

    The dataset was created for fine-tuning Llama-3-70B-Instruct as a judge on Arena Hard (https://github.com/lm-sys/arena-hard-auto)

      Dataset Info
    

    question_id: question id from Arena Hard instruction: original instruction from Arena Hard model: model whose responses are evaluated against the baseline model (gpt-4-0314) - gpt-4-turbo-2024-04-09 (score: 82.6) and Llama-2-70b-chat-hf (score: 11.6) input: responses of the evaluated… See the full description on the dataset page: https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3.

  10. h

    BenchMAX_Model-based

    • huggingface.co
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLaMAX (2025). BenchMAX_Model-based [Dataset]. https://huggingface.co/datasets/LLaMAX/BenchMAX_Model-based
    Explore at:
    Dataset updated
    Feb 11, 2025
    Authors
    LLaMAX
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Sources

    Paper: BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Link: https://huggingface.co/papers/2502.07346 Repository: https://github.com/CONE-MT/BenchMAX

      Dataset Description
    

    BenchMAX_Model-based is a dataset of BenchMAX, sourcing from m-ArenaHard, which evaluates the instruction following capability via model-based judgment. We extend the original dataset to include languages that are not supported by m-ArenaHard through… See the full description on the dataset page: https://huggingface.co/datasets/LLaMAX/BenchMAX_Model-based.

  11. h

    sos-artifacts

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataShare (2025). sos-artifacts [Dataset]. https://huggingface.co/datasets/DataShare/sos-artifacts
    Explore at:
    Dataset updated
    Apr 26, 2025
    Dataset authored and provided by
    DataShare
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This repository contains a range of Arena-Hard-Auto benchmark artifacts sourced as part of the 2024 paper Style Outweighs Substance. Repository Structure Model Responses for Arena Hard Auto Questions: data/ArenaHardAuto/model_answer

    Our standard reference model for pairwise comparisons was gpt-4-0314.

    Our standard set of comparison models was:

    Llama-3-8B Variants: bagel-8b-v1.0, Llama-3-8B-Magpie-Align-SFT-v0.2, Llama-3-8B-Magpie-Align-v0.2, Llama-3-8B-Tulu-330K, Llama-3-8B-WildChat… See the full description on the dataset page: https://huggingface.co/datasets/DataShare/sos-artifacts.

  12. h

    ko-arena-hard-auto-v0.1

    • huggingface.co
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Park (2025). ko-arena-hard-auto-v0.1 [Dataset]. https://huggingface.co/datasets/nwirandx/ko-arena-hard-auto-v0.1
    Explore at:
    Dataset updated
    Feb 12, 2025
    Authors
    Park
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    qwopqwop님이 ko-arena-hard를 번역하신 데이터 qwopqwop/ko-arena-hard-auto-v0.1에 tag를 단 데이터입니다.

      tag 정보
    

    Category Count Description

    Coding & Debugging 279 Users seek help with writing, reviewing, or fixing code in programming.

    Planning 67 Users need assistance in creating plans or strategies for activities and projects.

    Data analysis 31 Requests involve interpreting data, statistics, or performing analytical tasks.

    Math 26 Queries related to mathematical concepts, problems, and… See the full description on the dataset page: https://huggingface.co/datasets/nwirandx/ko-arena-hard-auto-v0.1.

  13. h

    deja-vu-pairwise-evals

    • huggingface.co
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere Labs (2025). deja-vu-pairwise-evals [Dataset]. https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Cohere Labs
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"

      Content
    

    This data contains pairwise automatic win-rate evaluations for 2 benchmarks.

    Outputs and judge decisions for the m-ArenaHard benchmark for sampled generations (5 each) from Aya Expanse 8B and Qwen2.5 7B Instruct. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/deja-vu-pairwise-evals.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cohere Labs (2025). m-ArenaHard [Dataset]. https://huggingface.co/datasets/CohereLabs/m-ArenaHard

m-ArenaHard

CohereLabs/m-ArenaHard

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Cohere Labs
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for m-ArenaHard

  Dataset Details

The m-ArenaHard dataset is a multilingual LLM evaluation set. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. The original English-only prompts were created by Li et al. (2024) and consist of 500 challenging user queries sourced from Chatbot Arena. The authors show that these can be used… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/m-ArenaHard.

Search
Clear search
Close search
Google apps
Main menu