15 datasets found

h
leaderboard-documents-gpqa
huggingface.co
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerome White (2025). leaderboard-documents-gpqa [Dataset]. https://huggingface.co/datasets/jerome-white/leaderboard-documents-gpqa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2025
Authors
Jerome White
Description
jerome-white/leaderboard-documents-gpqa dataset hosted on Hugging Face and contributed by the HF Datasets community
Gpqa by Model
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Gpqa by Model [Dataset]. https://artificialanalysis.ai/evaluations/gpqa-diamond
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Independently conducted by Artificial Analysis by Model
h
gpqa
huggingface.co
opendatalab.com
Updated Nov 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa
Explore at:
Dataset updated
Nov 21, 2023
Authors
David Rein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for GPQA

GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.
P
GPQA Dataset
paperswithcode.com
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman (2024). GPQA Dataset [Dataset]. https://paperswithcode.com/dataset/gpqa
Explore at:
Dataset updated
Apr 13, 2024
Authors
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman
Description
GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:

Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult. Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect). Google-Proof: The questions are "Google-proof," meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers. AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

The difficulty of GPQA for both skilled non-experts and cutting-edge AI systems makes it an excellent resource for conducting realistic scalable oversight experiments. These experiments aim to explore ways for human experts to reliably obtain truthful information from AI systems that surpass human capabilities¹³.

In summary, GPQA serves as a valuable benchmark for assessing the robustness and limitations of language models, especially when faced with complex and nuanced questions. Its difficulty level encourages research into effective oversight methods, bridging the gap between AI and human expertise.

(1) [2311.12022] GPQA: A Graduate-Level Google-Proof Q&A Benchmark - arXiv.org. https://arxiv.org/abs/2311.12022. (2) GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Klu. https://klu.ai/glossary/gpqa-eval. (3) GPA Dataset (Spring 2010 through Spring 2020) - Data Science Discovery. https://discovery.cs.illinois.edu/dataset/gpa/. (4) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - GitHub. https://github.com/idavidrein/gpqa. (5) Data Sets - OpenIntro. https://www.openintro.org/data/index.php?data=satgpa. (6) undefined. https://doi.org/10.48550/arXiv.2311.12022. (7) undefined. https://arxiv.org/abs/2311.12022%29.
Pricing by Model
artificialanalysis.ai
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Pricing by Model [Dataset]. https://artificialanalysis.ai/evaluations/gpqa-diamond
Explore at:
Dataset updated
May 15, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Cost (USD) to run the evaluation by Model
Tokens used to run the evaluation by Model
artificialanalysis.ai
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Tokens used to run the evaluation by Model [Dataset]. https://artificialanalysis.ai/evaluations/gpqa-diamond
Explore at:
Dataset updated
May 15, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Tokens used to run the evaluation by Model
Seconds to First Answer Token Received by Model
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Seconds to First Answer Token Received by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Seconds to First Answer Token Received; Accounts for Reasoning Model 'Thinking' time by Model
Output Speed by Model
artificialanalysis.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Output Speed by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Output Tokens per Second; Higher is better by Model
Llama 4 Maverick (FP8) Pricing: Input and Output by Provider
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Llama 4 Maverick (FP8) Pricing: Input and Output by Provider [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Price: USD per 1M Tokens; Lower is better by Provider
Seconds to Output 500 Tokens, including reasoning model 'thinking' time by...
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Seconds to Output 500 Tokens, including reasoning model 'thinking' time by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model
Pricing: Input and Output by Model
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Pricing: Input and Output by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Price: USD per 1M Tokens by Model
Llama 4 Maverick Output Speed by Provider
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Llama 4 Maverick Output Speed by Provider [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Output Speed: Output Tokens per Second by Provider
Intelligence vs. Output Speed by Model
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Intelligence vs. Output Speed by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model
Intelligence vs. Price by Model
artificialanalysis.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis, Intelligence vs. Price by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset authored and provided by
Artificial Analysis
Description
Comprehensive comparison of Artificial Analysis Intelligence Index vs. Price (USD per M Tokens, Log Scale, More Expensive to Cheaper) by Model
Intelligence Index by Devstral Endpoint
artificialanalysis.ai
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Intelligence Index by Devstral Endpoint [Dataset]. https://artificialanalysis.ai/models/devstral
Explore at:
Dataset updated
May 22, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model
Not seeing a result you expected?
Learn how you can add new datasets to our index.