22 datasets found
  1. h

    PPE-GPQA-Best-of-K

    • huggingface.co
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMArena (2024). PPE-GPQA-Best-of-K [Dataset]. https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    LMArena
    Description

    Overview

    This contains the GPQA correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from GPQA. This dataset is meant for benchmarking and evaluation, not for training. Paper Code

      License
    

    User prompts are licensed under CC BY 4.0, and model outputs are governed by the terms of use set by the respective model providers.

      Citation
    

    @misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K.

  2. Math Index by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Math Index by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model

  3. h

    ko-gpqa

    • huggingface.co
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    davidkim205 (2025). ko-gpqa [Dataset]. https://huggingface.co/datasets/davidkim205/ko-gpqa
    Explore at:
    Dataset updated
    Jul 22, 2025
    Authors
    davidkim205
    Description

    ko-gpqa

    ko-gpqa is a Korean-translated version of the GPQA (Graduate-Level Google‑Proof Q&A) benchmark dataset, which consists of high-difficulty science questions. Introduced in this paper, GPQA is designed to go beyond simple fact retrieval and instead test an AI system’s ability to perform deep understanding and logical reasoning. It is particularly useful for evaluating true comprehension and inference capabilities in language models. The Korean translation was performed using… See the full description on the dataset page: https://huggingface.co/datasets/davidkim205/ko-gpqa.

  4. a

    Seconds to First Answer Token Received by Model

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Seconds to First Answer Token Received by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Seconds to First Answer Token Received; Accounts for Reasoning Model 'Thinking' time by Model

  5. d

    GPQA Diamond 大模型评测基准排行榜

    • datalearner.com
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    数据学习 (DataLearner) (2025). GPQA Diamond 大模型评测基准排行榜 [Dataset]. https://www.datalearner.com/ai-benchmarks/gpqa-diamond
    Explore at:
    Dataset updated
    Jul 20, 2025
    Dataset authored and provided by
    数据学习 (DataLearner)
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    基于 GPQA Diamond 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。

  6. d

    GPQA 大模型评测基准排行榜

    • datalearner.com
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    数据学习 (DataLearner) (2025). GPQA 大模型评测基准排行榜 [Dataset]. https://www.datalearner.com/ai-benchmarks/gpqa
    Explore at:
    Dataset updated
    Jul 20, 2025
    Dataset authored and provided by
    数据学习 (DataLearner)
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    基于 GPQA 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。

  7. a

    Output Speed by Model

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Output Speed by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Output Tokens per Second; Higher is better by Model

  8. a

    Llama 4 Maverick Output Speed by Provider

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Llama 4 Maverick Output Speed by Provider [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Output Speed: Output Tokens per Second by Provider

  9. a

    Pricing: Input and Output by Model

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Pricing: Input and Output by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Price: USD per 1M Tokens by Model

  10. a

    Seconds to Output 500 Tokens, including reasoning model 'thinking' time by...

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Seconds to Output 500 Tokens, including reasoning model 'thinking' time by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model

  11. a

    Intelligence vs. Output Speed by Model

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Output Speed by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model

  12. a

    Intelligence vs. Price by Model

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Price by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Price (USD per M Tokens, Log Scale, More Expensive to Cheaper) by Model

  13. a

    Llama 4 Maverick (FP8) Pricing: Input and Output by Provider

    • artificialanalysis.ai
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Llama 4 Maverick (FP8) Pricing: Input and Output by Provider [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Price: USD per 1M Tokens; Lower is better by Provider

  14. Pricing: Image Input Pricing by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Pricing: Image Input Pricing by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comparison of Image Input Price: USD per 1k images at 1MP (1024x1024) by Model

  15. End-to-End Response Time by Input Token Count by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). End-to-End Response Time by Input Token Count by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model

  16. Intelligence vs. Intelligence by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Intelligence by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Tokens Used in Artificial Analysis Intelligence Index (Log Scale) by Model

  17. Intelligence vs. Output Speed by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Output Speed by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model

  18. Intelligence vs. Context Window by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Context Window by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Context Window (Tokens) by Model

  19. Intelligence Index by Grok Endpoint

    • artificialanalysis.ai
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by Grok Endpoint [Dataset]. https://artificialanalysis.ai/models/grok-4
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  20. Output Speed vs. Price by Models Model

    • artificialanalysis.ai
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Output Speed vs. Price by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Authors
    Artificial Analysis
    Description

    Comprehensive comparison of Output Speed (Output Tokens per Second) vs. Price (USD per M Tokens) by Model

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
LMArena (2024). PPE-GPQA-Best-of-K [Dataset]. https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K

PPE-GPQA-Best-of-K

lmarena-ai/PPE-GPQA-Best-of-K

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 25, 2024
Dataset authored and provided by
LMArena
Description

Overview

This contains the GPQA correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from GPQA. This dataset is meant for benchmarking and evaluation, not for training. Paper Code

  License

User prompts are licensed under CC BY 4.0, and model outputs are governed by the terms of use set by the respective model providers.

  Citation

@misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K.

Search
Clear search
Close search
Google apps
Main menu