fingertap/GPQA-Diamond dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for GPQA
GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.
Comparison of Independently conducted by Artificial Analysis by Model
jinulee-v/gpqa-diamond dataset hosted on Hugging Face and contributed by the HF Datasets community
GPQA Diamond Dataset
This dataset contains filtered JSONL files of human annotations on question specificity, answer uniqueness, answer matching to the ground truth for different models for the GPQA Diamond dataset.
The dataset was annotated by two human graders. It contains 198 (original size) * 2 = 396 rows as each rows is repeated twice (one for each human). A human grader given the question, actual answer and model response, has to answer whether the response matches the… See the full description on the dataset page: https://huggingface.co/datasets/nikhilchandak/gpqa-diamond-annotations.
Comparison of GPQA Diamond (Scientific Reasoning) by Model
Wilson-Lee/tts-embed-dataset-gpqa-diamond dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
nikhilchandak/GPQA-diamond-free dataset hosted on Hugging Face and contributed by the HF Datasets community
Comparison of Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR by Model
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
nikhilchandak/gpqa-diamond-test2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Comparison of Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR by Model
Comparison of Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR by Model
mkhalifa/gpqa-diamond-physics dataset hosted on Hugging Face and contributed by the HF Datasets community
Comparison of Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500 by Model
drproduck/r1-qwen7b-gpqa-diamond-n128 dataset hosted on Hugging Face and contributed by the HF Datasets community
Comparison of Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR by Model
Reasoning PRM Preference Dataset
This dataset contains reasoning traces from multiple sources (GPQA Diamond and MMLU Pro), labeled with preference information based on correctness verification.
Dataset Description
Overview
The dataset consists of reasoning problems and their solutions, where each example has been verified for correctness and labeled with a preference score. It combines data from two main sources:
GPQA Diamond MMLU Pro
Data Fields… See the full description on the dataset page: https://huggingface.co/datasets/ariaattarml/verified-reasoning-o1-gpqa-mmlu-pro.
Comparison of Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR by Model
HLE SFT GPQA Diamond Dataset
概要
このデータセットは、GPQA (Graduate-level Google-proof Q&A) Diamond データセットを基に、Chain of Thought (CoT) 推論を追加して生成されたSupervised Fine-Tuning (SFT) 用のデータセットです。 専門的な科学分野(物理学、化学、生物学)における高度な質問に対して、段階的な推論プロセスを含む回答を提供します。
データセット統計
総問題数: 198 問 成功生成数: 61 問 成功率: 30.8%
ファイル形式
このデータセットは以下の3つの形式で提供されています:
1. CSV形式 (gpqa_diamond_cot_dataset.csv)
一般的な表形式データ Excel やスプレッドシートソフトで開けます Pandas で簡単に読み込み可能
2. Parquet形式… See the full description on the dataset page: https://huggingface.co/datasets/neko-llm/HLE_SFT_GPQA_Diamond.
Comparison of Artificial Analysis Intelligence Index v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR by Model
fingertap/GPQA-Diamond dataset hosted on Hugging Face and contributed by the HF Datasets community