51 datasets found
  1. h

    gpqa

    • huggingface.co
    • opendatalab.com
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa
    Explore at:
    Dataset updated
    Nov 21, 2023
    Authors
    David Rein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for GPQA

    GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

  2. P

    GPQA Dataset

    • paperswithcode.com
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman (2024). GPQA Dataset [Dataset]. https://paperswithcode.com/dataset/gpqa
    Explore at:
    Dataset updated
    Apr 13, 2024
    Authors
    David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman
    Description

    GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:

    Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult. Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect). Google-Proof: The questions are "Google-proof," meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers. AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

    The difficulty of GPQA for both skilled non-experts and cutting-edge AI systems makes it an excellent resource for conducting realistic scalable oversight experiments. These experiments aim to explore ways for human experts to reliably obtain truthful information from AI systems that surpass human capabilities¹³.

    In summary, GPQA serves as a valuable benchmark for assessing the robustness and limitations of language models, especially when faced with complex and nuanced questions. Its difficulty level encourages research into effective oversight methods, bridging the gap between AI and human expertise.

    (1) [2311.12022] GPQA: A Graduate-Level Google-Proof Q&A Benchmark - arXiv.org. https://arxiv.org/abs/2311.12022. (2) GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Klu. https://klu.ai/glossary/gpqa-eval. (3) GPA Dataset (Spring 2010 through Spring 2020) - Data Science Discovery. https://discovery.cs.illinois.edu/dataset/gpa/. (4) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - GitHub. https://github.com/idavidrein/gpqa. (5) Data Sets - OpenIntro. https://www.openintro.org/data/index.php?data=satgpa. (6) undefined. https://doi.org/10.48550/arXiv.2311.12022. (7) undefined. https://arxiv.org/abs/2311.12022%29.

  3. Gpqa by Model

    • artificialanalysis.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis, Gpqa by Model [Dataset]. https://artificialanalysis.ai/evaluations/gpqa-diamond
    Explore at:
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Independently conducted by Artificial Analysis by Model

  4. h

    GPQA-Diamond

    • huggingface.co
    Updated May 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han (2025). GPQA-Diamond [Dataset]. https://huggingface.co/datasets/fingertap/GPQA-Diamond
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Han
    Description

    fingertap/GPQA-Diamond dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    gpqa-diamond-annotations

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Chandak, gpqa-diamond-annotations [Dataset]. https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-annotations
    Explore at:
    Authors
    Nikhil Chandak
    Description

    GPQA Diamond Dataset

    This dataset contains filtered JSONL files of human annotations on question specificity, answer uniqueness, answer matching to the ground truth for different models for the GPQA Diamond dataset.

    The dataset was annotated by two human graders. It contains 198 (original size) * 2 = 396 rows as each rows is repeated twice (one for each human). A human grader given the question, actual answer and model response, has to answer whether the response matches the… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-annotations.

  6. h

    GPQA-diamond-free

    • huggingface.co
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Chandak (2025). GPQA-diamond-free [Dataset]. https://huggingface.co/datasets/nikhilchandak/GPQA-diamond-free
    Explore at:
    Dataset updated
    Jun 26, 2025
    Authors
    Nikhil Chandak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    nikhilchandak/GPQA-diamond-free dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Intelligence Index by GPT-4o Endpoint

    • artificialanalysis.ai
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by GPT-4o Endpoint [Dataset]. https://artificialanalysis.ai/models/gpt-4o
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  8. h

    gpqa-diamond-test2

    • huggingface.co
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Chandak (2025). gpqa-diamond-test2 [Dataset]. https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-test2
    Explore at:
    Dataset updated
    Jun 12, 2025
    Authors
    Nikhil Chandak
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    nikhilchandak/gpqa-diamond-test2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. Intelligence Index by Models Model

    • artificialanalysis.ai
    Updated Jun 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by Models Model [Dataset]. https://artificialanalysis.ai/models/open-source
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  10. h

    gpqa-diamond-physics

    • huggingface.co
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Khalifa (2025). gpqa-diamond-physics [Dataset]. https://huggingface.co/datasets/mkhalifa/gpqa-diamond-physics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2025
    Authors
    Muhammad Khalifa
    Description

    mkhalifa/gpqa-diamond-physics dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    verified-reasoning-o1-gpqa-mmlu-pro

    • huggingface.co
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aria A. (2024). verified-reasoning-o1-gpqa-mmlu-pro [Dataset]. https://huggingface.co/datasets/ariaattarml/verified-reasoning-o1-gpqa-mmlu-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2024
    Authors
    Aria A.
    Description

    Reasoning PRM Preference Dataset

    This dataset contains reasoning traces from multiple sources (GPQA Diamond and MMLU Pro), labeled with preference information based on correctness verification.

      Dataset Description
    
    
    
    
    
      Overview
    

    The dataset consists of reasoning problems and their solutions, where each example has been verified for correctness and labeled with a preference score. It combines data from two main sources:

    GPQA Diamond MMLU Pro

      Data Fields… See the full description on the dataset page: https://huggingface.co/datasets/ariaattarml/verified-reasoning-o1-gpqa-mmlu-pro.
    
  12. Intelligence Index by Grok-1 Endpoint

    • artificialanalysis.ai
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by Grok-1 Endpoint [Dataset]. https://artificialanalysis.ai/models/grok-1
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  13. h

    GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1

    • huggingface.co
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HazyResearch (2025). GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1 [Dataset]. https://huggingface.co/datasets/hazyresearch/GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1
    Explore at:
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    HazyResearch
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GPQA Diamond with Llama-3.1-70B-Instruct (up to 1K Samples)

    This dataset contains 198 graduate-level science questions from the GPQA Diamond benchmark with up to 1000 candidate responses generated by Llama-3.1-70B-Instruct for each problem. Each response has been evaluated for correctness using a mixture of GPT-4o-mini and procedural Python code to robustly parse different answer formats, and scored by multiple reward models (scalar values) and LM judges (boolean verdicts). For more… See the full description on the dataset page: https://huggingface.co/datasets/hazyresearch/GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1.

  14. h

    Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherry Smith, Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA [Dataset]. https://huggingface.co/datasets/Xuerui2312/Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA
    Explore at:
    Authors
    Sherry Smith
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In view of the significant performance improvements recently demonstrated by the Qwen3 series of models, we conducted a comprehensive evaluation of their capabilities across a range of representative benchmarks. Specifically, we evaluated the Qwen3 models on AIME2024, AIME2025, and GPQA Diamond. The prompt format used in these experiments is provided in the response files; additional details regarding the prompt design will be presented at a later time. Each set of inference experiments… See the full description on the dataset page: https://huggingface.co/datasets/Xuerui2312/Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA.

  15. h

    freeform-datasets

    • huggingface.co
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Chandak (2025). freeform-datasets [Dataset]. https://huggingface.co/datasets/nikhilchandak/freeform-datasets
    Explore at:
    Dataset updated
    Jul 3, 2025
    Authors
    Nikhil Chandak
    Description

    Freeform Datasets

    This repository contains two carefully curated datasets for evaluating large language models on human-filtered subset of popular benchmarks which are suitable for evaluation in freeform (open-ended) format. These datasets were developed as part of our paper on Answer Matching Outperforms Multiple Choice for Language Model Evaluation.

      Dataset Structure
    

    The repository contains two splits:

      1. gpqa_diamond Split
    

    Source: Filtered subset of… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/freeform-datasets.

  16. Intelligence Index by PALM-2 Endpoint

    • artificialanalysis.ai
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by PALM-2 Endpoint [Dataset]. https://artificialanalysis.ai/models/palm-2
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model

  17. h

    answer-matching

    • huggingface.co
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Chandak (2025). answer-matching [Dataset]. https://huggingface.co/datasets/nikhilchandak/answer-matching
    Explore at:
    Dataset updated
    Jul 3, 2025
    Authors
    Nikhil Chandak
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Answer Matching Dataset

    This dataset contains a single split for human annotation analysis:

    gpqa_diamond_annotations: Combined GPQA Diamond annotations from all annotators (Ameya + Nikhil)

    All other evaluation files are available in the "Files and versions" tab, preserving the original directory structure.

      Directory Structure and Data Overview
    
    
    
    
    
      gpqa_diamond_mcq
    

    combined_samples.jsonl samples_deepseek-r1-0528.jsonl samples_llama-4-scout.jsonl… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/answer-matching.

  18. h

    OpenR1-Math-220k_decontaminated

    • huggingface.co
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Martin (2025). OpenR1-Math-220k_decontaminated [Dataset]. https://huggingface.co/datasets/notpaulmartin/OpenR1-Math-220k_decontaminated
    Explore at:
    Dataset updated
    Feb 12, 2025
    Authors
    Paul Martin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenR1-Math-220k_decontaminated

    Decontaminated version of open-r1/OpenR1-Math-220k - default/train

      Decontamination
    

    Removed any questions that have an 8-gram overlap with common benchmarks: AIME 2024, AIME 2025, MATH500, GPQA Diamond, LiveCodeBench Code Generation Lite Used GitHub:huggingface/open-r1/scripts/decontaminate.py with all defaults following https://github.com/huggingface/open-r1#data-decontamination

    python scripts/decontaminate.py
    --dataset… See the full description on the dataset page: https://huggingface.co/datasets/notpaulmartin/OpenR1-Math-220k_decontaminated.

  19. h

    gpqa_diamond

    • huggingface.co
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aradhye Agarwal (2025). gpqa_diamond [Dataset]. https://huggingface.co/datasets/aradhye/gpqa_diamond
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Aradhye Agarwal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    aradhye/gpqa_diamond dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. OpenScience

    • huggingface.co
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2025). OpenScience [Dataset]. https://huggingface.co/datasets/nvidia/OpenScience
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description:

    OpenScience is a multi-domain synthetic dataset designed to improve general-purpose reasoning in large language models (LLMs). The dataset contains multiple-choice question-answer pairs with detailed reasoning traces and spans across diverse scientific domains, including STEM, law, economics, and humanities. OpenScience aims to boost accuracy on advanced benchmarks such as GPQA-Diamond and MMLU-Pro via supervised finetuning or reinforcement learning. This… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenScience.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa

gpqa

GPQA

Idavidrein/gpqa

Explore at:
Dataset updated
Nov 21, 2023
Authors
David Rein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Card for GPQA

GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

Search
Clear search
Close search
Google apps
Main menu