14 datasets found

h
gpqa
huggingface.co
opendatalab.com
Updated Nov 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa
Explore at:
Dataset updated
Nov 21, 2023
Authors
David Rein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for GPQA

GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.
h
GPQA-Diamond
huggingface.co
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han (2025). GPQA-Diamond [Dataset]. https://huggingface.co/datasets/fingertap/GPQA-Diamond
Explore at:
Dataset updated
May 28, 2025
Authors
Han
Description
fingertap/GPQA-Diamond dataset hosted on Hugging Face and contributed by the HF Datasets community
h
GPQA-diamond-free
huggingface.co
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Chandak (2025). GPQA-diamond-free [Dataset]. https://huggingface.co/datasets/nikhilchandak/GPQA-diamond-free
Explore at:
Dataset updated
Jun 26, 2025
Authors
Nikhil Chandak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
nikhilchandak/GPQA-diamond-free dataset hosted on Hugging Face and contributed by the HF Datasets community
h
gpqa-diamond-test2
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Chandak (2025). gpqa-diamond-test2 [Dataset]. https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-test2
Explore at:
Dataset updated
Jun 12, 2025
Authors
Nikhil Chandak
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
nikhilchandak/gpqa-diamond-test2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
gpqa-diamond-physics
huggingface.co
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Khalifa (2025). gpqa-diamond-physics [Dataset]. https://huggingface.co/datasets/mkhalifa/gpqa-diamond-physics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2025
Authors
Muhammad Khalifa
Description
mkhalifa/gpqa-diamond-physics dataset hosted on Hugging Face and contributed by the HF Datasets community
h
gpqa-diamond-annotations
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Chandak, gpqa-diamond-annotations [Dataset]. https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-annotations
Explore at:
Authors
Nikhil Chandak
Description
GPQA Diamond Dataset

This dataset contains filtered JSONL files of human annotations on question specificity, answer uniqueness, answer matching to the ground truth for different models for the GPQA Diamond dataset.

The dataset was annotated by two human graders. It contains 198 (original size) * 2 = 396 rows as each rows is repeated twice (one for each human). A human grader given the question, actual answer and model response, has to answer whether the response matches the… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-annotations.
h
verified-reasoning-o1-gpqa-mmlu-pro
huggingface.co
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aria A. (2024). verified-reasoning-o1-gpqa-mmlu-pro [Dataset]. https://huggingface.co/datasets/ariaattarml/verified-reasoning-o1-gpqa-mmlu-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Authors
Aria A.
Description
Reasoning PRM Preference Dataset

This dataset contains reasoning traces from multiple sources (GPQA Diamond and MMLU Pro), labeled with preference information based on correctness verification.

Dataset Description Overview

The dataset consists of reasoning problems and their solutions, where each example has been verified for correctness and labeled with a preference score. It combines data from two main sources:

GPQA Diamond MMLU Pro

Data Fields… See the full description on the dataset page: https://huggingface.co/datasets/ariaattarml/verified-reasoning-o1-gpqa-mmlu-pro.
h
GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1
huggingface.co
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HazyResearch (2025). GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1 [Dataset]. https://huggingface.co/datasets/hazyresearch/GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1
Explore at:
Dataset updated
Jun 24, 2025
Dataset authored and provided by
HazyResearch
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GPQA Diamond with Llama-3.1-70B-Instruct (up to 1K Samples)

This dataset contains 198 graduate-level science questions from the GPQA Diamond benchmark with up to 1000 candidate responses generated by Llama-3.1-70B-Instruct for each problem. Each response has been evaluated for correctness using a mixture of GPT-4o-mini and procedural Python code to robustly parse different answer formats, and scored by multiple reward models (scalar values) and LM judges (boolean verdicts). For more… See the full description on the dataset page: https://huggingface.co/datasets/hazyresearch/GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1.
h
gpqa_diamond
huggingface.co
Updated May 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aradhye Agarwal (2025). gpqa_diamond [Dataset]. https://huggingface.co/datasets/aradhye/gpqa_diamond
Explore at:
Dataset updated
May 21, 2025
Authors
Aradhye Agarwal
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
aradhye/gpqa_diamond dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherry Smith, Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA [Dataset]. https://huggingface.co/datasets/Xuerui2312/Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA
Explore at:
Authors
Sherry Smith
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In view of the significant performance improvements recently demonstrated by the Qwen3 series of models, we conducted a comprehensive evaluation of their capabilities across a range of representative benchmarks. Specifically, we evaluated the Qwen3 models on AIME2024, AIME2025, and GPQA Diamond. The prompt format used in these experiments is provided in the response files; additional details regarding the prompt design will be presented at a later time. Each set of inference experiments… See the full description on the dataset page: https://huggingface.co/datasets/Xuerui2312/Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA.
h
freeform-datasets
huggingface.co
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Chandak (2025). freeform-datasets [Dataset]. https://huggingface.co/datasets/nikhilchandak/freeform-datasets
Explore at:
Dataset updated
Jul 3, 2025
Authors
Nikhil Chandak
Description
Freeform Datasets

This repository contains two carefully curated datasets for evaluating large language models on human-filtered subset of popular benchmarks which are suitable for evaluation in freeform (open-ended) format. These datasets were developed as part of our paper on Answer Matching Outperforms Multiple Choice for Language Model Evaluation.

Dataset Structure

The repository contains two splits:

1. gpqa_diamond Split

Source: Filtered subset of… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/freeform-datasets.
h
OpenR1-Math-220k_decontaminated
huggingface.co
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Martin (2025). OpenR1-Math-220k_decontaminated [Dataset]. https://huggingface.co/datasets/notpaulmartin/OpenR1-Math-220k_decontaminated
Explore at:
Dataset updated
Feb 12, 2025
Authors
Paul Martin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenR1-Math-220k_decontaminated

Decontaminated version of open-r1/OpenR1-Math-220k - default/train

Decontamination

Removed any questions that have an 8-gram overlap with common benchmarks: AIME 2024, AIME 2025, MATH500, GPQA Diamond, LiveCodeBench Code Generation Lite Used GitHub:huggingface/open-r1/scripts/decontaminate.py with all defaults following https://github.com/huggingface/open-r1#data-decontamination

python scripts/decontaminate.py
--dataset… See the full description on the dataset page: https://huggingface.co/datasets/notpaulmartin/OpenR1-Math-220k_decontaminated.
h
answer-matching
huggingface.co
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Chandak (2025). answer-matching [Dataset]. https://huggingface.co/datasets/nikhilchandak/answer-matching
Explore at:
Dataset updated
Jul 3, 2025
Authors
Nikhil Chandak
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Answer Matching Dataset

This dataset contains a single split for human annotation analysis:

gpqa_diamond_annotations: Combined GPQA Diamond annotations from all annotators (Ameya + Nikhil)

All other evaluation files are available in the "Files and versions" tab, preserving the original directory structure.

Directory Structure and Data Overview gpqa_diamond_mcq

combined_samples.jsonl samples_deepseek-r1-0528.jsonl samples_llama-4-scout.jsonl… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/answer-matching.
OpenScience
huggingface.co
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NVIDIA (2025). OpenScience [Dataset]. https://huggingface.co/datasets/nvidia/OpenScience
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description:

OpenScience is a multi-domain synthetic dataset designed to improve general-purpose reasoning in large language models (LLMs). The dataset contains multiple-choice question-answer pairs with detailed reasoning traces and spans across diverse scientific domains, including STEM, law, economics, and humanities. OpenScience aims to boost accuracy on advanced benchmarks such as GPQA-Diamond and MMLU-Pro via supervised finetuning or reinforcement learning. This… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenScience.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa

gpqa

GPQA

Idavidrein/gpqa

Explore at:

Dataset updated

Nov 21, 2023

Authors

David Rein

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Card for GPQA

GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

Clear search

Close search

Google apps

Main menu

gpqa

GPQA-Diamond

GPQA-diamond-free

gpqa-diamond-test2

gpqa-diamond-physics

gpqa-diamond-annotations

verified-reasoning-o1-gpqa-mmlu-pro

GPQA_Diamond_with_Llama_3.1_70B_Instruct_up_to_1K_Samples_v1

gpqa_diamond

Qwen3-8B-Rollout64-32k-AIME2024-AIME2025-GPQA

freeform-datasets

OpenR1-Math-220k_decontaminated

answer-matching

OpenScience

gpqaSee More Versions

GPQA

Idavidrein/gpqa

gpqa