Overview
This contains the GPQA correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from GPQA. This dataset is meant for benchmarking and evaluation, not for training. Paper Code
License
User prompts are licensed under CC BY 4.0, and model outputs are governed by the terms of use set by the respective model providers.
Citation
@misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K.
Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model
ko-gpqa
ko-gpqa is a Korean-translated version of the GPQA (Graduate-Level Google‑Proof Q&A) benchmark dataset, which consists of high-difficulty science questions. Introduced in this paper, GPQA is designed to go beyond simple fact retrieval and instead test an AI system’s ability to perform deep understanding and logical reasoning. It is particularly useful for evaluating true comprehension and inference capabilities in language models. The Korean translation was performed using… See the full description on the dataset page: https://huggingface.co/datasets/davidkim205/ko-gpqa.
Comparison of Seconds to First Answer Token Received; Accounts for Reasoning Model 'Thinking' time by Model
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
基于 GPQA Diamond 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
基于 GPQA 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。
Comparison of Output Tokens per Second; Higher is better by Model
Comparison of Output Speed: Output Tokens per Second by Provider
Comparison of Price: USD per 1M Tokens by Model
Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Price (USD per M Tokens, Log Scale, More Expensive to Cheaper) by Model
Comparison of Price: USD per 1M Tokens; Lower is better by Provider
Comparison of Image Input Price: USD per 1k images at 1MP (1024x1024) by Model
Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Tokens Used in Artificial Analysis Intelligence Index (Log Scale) by Model
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Context Window (Tokens) by Model
Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model
Comprehensive comparison of Output Speed (Output Tokens per Second) vs. Price (USD per M Tokens) by Model
Overview
This contains the GPQA correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from GPQA. This dataset is meant for benchmarking and evaluation, not for training. Paper Code
License
User prompts are licensed under CC BY 4.0, and model outputs are governed by the terms of use set by the respective model providers.
Citation
@misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K.