41 datasets found
  1. P

    GPQA Dataset

    • paperswithcode.com
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman (2024). GPQA Dataset [Dataset]. https://paperswithcode.com/dataset/gpqa
    Explore at:
    Dataset updated
    Apr 13, 2024
    Authors
    David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman
    Description

    GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:

    Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult. Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect). Google-Proof: The questions are "Google-proof," meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers. AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

    The difficulty of GPQA for both skilled non-experts and cutting-edge AI systems makes it an excellent resource for conducting realistic scalable oversight experiments. These experiments aim to explore ways for human experts to reliably obtain truthful information from AI systems that surpass human capabilities¹³.

    In summary, GPQA serves as a valuable benchmark for assessing the robustness and limitations of language models, especially when faced with complex and nuanced questions. Its difficulty level encourages research into effective oversight methods, bridging the gap between AI and human expertise.

    (1) [2311.12022] GPQA: A Graduate-Level Google-Proof Q&A Benchmark - arXiv.org. https://arxiv.org/abs/2311.12022. (2) GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Klu. https://klu.ai/glossary/gpqa-eval. (3) GPA Dataset (Spring 2010 through Spring 2020) - Data Science Discovery. https://discovery.cs.illinois.edu/dataset/gpa/. (4) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - GitHub. https://github.com/idavidrein/gpqa. (5) Data Sets - OpenIntro. https://www.openintro.org/data/index.php?data=satgpa. (6) undefined. https://doi.org/10.48550/arXiv.2311.12022. (7) undefined. https://arxiv.org/abs/2311.12022%29.

  2. h

    gpqa

    • huggingface.co
    • opendatalab.com
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa
    Explore at:
    Dataset updated
    Nov 21, 2023
    Authors
    David Rein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for GPQA

    GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

  3. Gpqa by Model

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Gpqa by Model [Dataset]. https://artificialanalysis.ai/evaluations/gpqa-diamond
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Independently conducted by Artificial Analysis by Model

  4. h

    PPE-GPQA-Best-of-K

    • huggingface.co
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PPE-GPQA-Best-of-K [Dataset]. https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    LMArena
    Description

    Overview

    This contains the GPQA correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from GPQA. This dataset is meant for benchmarking and evaluation, not for training. Paper Code

      License
    

    User prompts are licensed under CC BY 4.0, and model outputs are governed by the terms of use set by the respective model providers.

      Citation
    

    @misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/PPE-GPQA-Best-of-K.

  5. h

    gpqa

    • huggingface.co
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casimir Nuesperling (2025). gpqa [Dataset]. https://huggingface.co/datasets/casimiir/gpqa
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Casimir Nuesperling
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a reformatted version of the original GPQA dataset from Idavidrein/gpqa. It includes only the main question, four shuffled answer choices, the correct answer index, subdomain, and a unique id for each entry.Please cite the GPQA paper if you use this data: GPQA: A Graduate-Level Google-Proof Q&A Benchmark.

  6. h

    GPQA_with_Llama_3.1_70B_Instruct_v1

    • huggingface.co
    Updated Jun 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HazyResearch (2025). GPQA_with_Llama_3.1_70B_Instruct_v1 [Dataset]. https://huggingface.co/datasets/hazyresearch/GPQA_with_Llama_3.1_70B_Instruct_v1
    Explore at:
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    HazyResearch
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GPQA with Llama-3.1-70B-Instruct

    This dataset contains 646 graduate-level science questions from the GPQA benchmark with 100 candidate responses generated by Llama-3.1-70B-Instruct for each problem. Each response has been evaluated for correctness using a mixture of GPT-4o-mini and procedural Python code to robustly parse different answer formats, and scored by multiple reward models (scalar values) and LM judges (boolean verdicts).

      Dataset Structure
    

    Split: Single… See the full description on the dataset page: https://huggingface.co/datasets/hazyresearch/GPQA_with_Llama_3.1_70B_Instruct_v1.

  7. Intelligence Index by GPT-4o Endpoint

    • artificialanalysis.ai
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by GPT-4o Endpoint [Dataset]. https://artificialanalysis.ai/models/gpt-4o
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  8. Intelligence Index by Models Model

    • artificialanalysis.ai
    Updated Jun 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by Models Model [Dataset]. https://artificialanalysis.ai/models/open-source
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  9. h

    GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1

    • huggingface.co
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HazyResearch (2025). GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1 [Dataset]. https://huggingface.co/datasets/hazyresearch/GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1
    Explore at:
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    HazyResearch
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GPQA Diamond with Llama-3.1-70B-Instruct (up to 1K Samples)

    This dataset contains 198 graduate-level science questions from the GPQA Diamond benchmark with up to 1000 candidate responses generated by Llama-3.1-70B-Instruct for each problem. Each response has been evaluated for correctness using a mixture of GPT-4o-mini and procedural Python code to robustly parse different answer formats, and scored by multiple reward models (scalar values) and LM judges (boolean verdicts). For more… See the full description on the dataset page: https://huggingface.co/datasets/hazyresearch/GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1.

  10. Math Index by GPT-4o Endpoint

    • artificialanalysis.ai
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Math Index by GPT-4o Endpoint [Dataset]. https://artificialanalysis.ai/models/gpt-4o
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model

  11. Coding Index by GPT-4o Endpoint

    • artificialanalysis.ai
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Coding Index by GPT-4o Endpoint [Dataset]. https://artificialanalysis.ai/models/gpt-4o
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode) by Model

  12. Llama 4 Maverick (FP8) Pricing: Input and Output by Provider

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Llama 4 Maverick (FP8) Pricing: Input and Output by Provider [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Price: USD per 1M Tokens; Lower is better by Provider

  13. h

    DATA-AI_full-reasoning

    • huggingface.co
    Updated Jun 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M.INC. (2025). DATA-AI_full-reasoning [Dataset]. https://huggingface.co/datasets/Mattimax/DATA-AI_full-reasoning
    Explore at:
    Dataset updated
    Jun 3, 2025
    Authors
    M.INC.
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    NaturalReasoning is a large-scale dataset for general reasoning tasks. It consists of high-quality challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The questions have been deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, MMLU-STEM. For each question, we extract the reference final answer from the original document from the pretraining corpora if possible. We also provide a model-generated response from… See the full description on the dataset page: https://huggingface.co/datasets/Mattimax/DATA-AI_full-reasoning.

  14. Output Speed by Model

    • artificialanalysis.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Output Speed by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Output Tokens per Second; Higher is better by Model

  15. Pricing: Input and Output by Model

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Pricing: Input and Output by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Price: USD per 1M Tokens by Model

  16. Seconds to Output 500 Tokens, including reasoning model 'thinking' time by...

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Seconds to Output 500 Tokens, including reasoning model 'thinking' time by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model

  17. h

    ReasonSet

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toby Simonds, ReasonSet [Dataset]. https://huggingface.co/datasets/TamasSimonds/ReasonSet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Toby Simonds
    Description

    ReasonSet Dataset

    This dataset is sourced from the paper ["REL: Working Out Is All You Need"].

      Dataset Description
    

    ReasonSet is a dataset of problems and their worked solutions, specifically designed to help improve models reasoning abilities. Questions are sourced from AIME, GPQA, MATH and some hand created ones

    Question: The question Working out: Indepth solution with reasoning steps provided_solution: Provided Solution by the benchmark

      Citation
    

    If you use… See the full description on the dataset page: https://huggingface.co/datasets/TamasSimonds/ReasonSet.

  18. Llama 4 Maverick Output Speed by Provider

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Llama 4 Maverick (Turbo, FP8) Output Speed by Provider [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Output Speed: Output Tokens per Second by Provider

  19. Seconds to First Answer Token Received by Model

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Seconds to First Answer Token Received by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Seconds to First Answer Token Received; Accounts for Reasoning Model 'Thinking' time by Model

  20. Intelligence vs. Output Speed by Model

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Intelligence vs. Output Speed by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman (2024). GPQA Dataset [Dataset]. https://paperswithcode.com/dataset/gpqa

GPQA Dataset

Explore at:
Dataset updated
Apr 13, 2024
Authors
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman
Description

GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:

Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult. Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect). Google-Proof: The questions are "Google-proof," meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers. AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

The difficulty of GPQA for both skilled non-experts and cutting-edge AI systems makes it an excellent resource for conducting realistic scalable oversight experiments. These experiments aim to explore ways for human experts to reliably obtain truthful information from AI systems that surpass human capabilities¹³.

In summary, GPQA serves as a valuable benchmark for assessing the robustness and limitations of language models, especially when faced with complex and nuanced questions. Its difficulty level encourages research into effective oversight methods, bridging the gap between AI and human expertise.

(1) [2311.12022] GPQA: A Graduate-Level Google-Proof Q&A Benchmark - arXiv.org. https://arxiv.org/abs/2311.12022. (2) GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Klu. https://klu.ai/glossary/gpqa-eval. (3) GPA Dataset (Spring 2010 through Spring 2020) - Data Science Discovery. https://discovery.cs.illinois.edu/dataset/gpa/. (4) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - GitHub. https://github.com/idavidrein/gpqa. (5) Data Sets - OpenIntro. https://www.openintro.org/data/index.php?data=satgpa. (6) undefined. https://doi.org/10.48550/arXiv.2311.12022. (7) undefined. https://arxiv.org/abs/2311.12022%29.

Search
Clear search
Close search
Google apps
Main menu