100+ datasets found

P
MATH Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math
Explore at:
Dataset updated
Jan 10, 2025
Authors
Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt
Description
MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
Math benchmark comparison between Gemini Ultra and GPT-4 in 2024
statista.com
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Math benchmark comparison between Gemini Ultra and GPT-4 in 2024 [Dataset]. https://www.statista.com/statistics/1450845/math-benchmark-comparison-gemini-ultra-gpt-4/
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 2023
Area covered
Worldwide
Description
Gemini Ultra and GPT-4 are the leading generative AI platforms worldwide, and they compare alike in the mathematical benchmarks. Gemini Ultra is slightly ahead of GPT-4, beating it in both mathematical benchmarks as well as the general MMLU benchmarking by around * percent. The lead is not so significant as to consider GPT-4 a lackluster model.
h
gsm8k
huggingface.co
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for GSM8K

Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
P
MathBench Dataset
paperswithcode.com
Updated May 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongwei Liu; Zilong Zheng; Yuxuan Qiao; Haodong Duan; Zhiwei Fei; Fengzhe Zhou; Wenwei Zhang; Songyang Zhang; Dahua Lin; Kai Chen (2025). MathBench Dataset [Dataset]. https://paperswithcode.com/dataset/mathbench
Explore at:
Dataset updated
May 25, 2025
Authors
Hongwei Liu; Zilong Zheng; Yuxuan Qiao; Haodong Duan; Zhiwei Fei; Fengzhe Zhou; Wenwei Zhang; Songyang Zhang; Dahua Lin; Kai Chen
Description
MathBench is an All in One math dataset for language model evaluation, with:

A Sophisticated Five-Stage Difficulty Mechanism: Unlike the usual mathematical datasets that can only evaluate a single difficulty level or have a mix of unclear difficulty levels, MathBench provides 3709 questions with a gradient difficulty division by education stages, ranging from basic arithmetic to primary, middle, high school, and college levels, allowing you to get a clear overview of the comprehensive difficulty evaluation results.

Bilingual Gradient Evaluation: Apart from the basic calculation part which is language-independent, MathBench provides questions in both Chinese and English for the four-stage difficulty datasets from primary to college.

Implementation of the Robust Circular Evaluation (CE) Method: MathBench use CE as the main evaluation method for questions. Compared to the general Accuracy evaluation method, CE requires the model to answer the same multiple-choice question multiple times, with the order of the options changing each time. The model is considered correct on this question only if all answers are correct. The results of CE can reflect the model's capabilities more realistically, providing more valuable evaluation results.

Support for Basic Theory Questions: For every stage, MathBench provides questions that cover the basic theory knowledge points of the corresponding stage, to ascertain whether the model has genuinely mastered the fundamental concepts of each stage or merely memorized the answers.
P
Data from: MGSM Dataset
paperswithcode.com
Updated Aug 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei (2023). MGSM Dataset [Dataset]. https://paperswithcode.com/dataset/mgsm
Explore at:
Dataset updated
Aug 20, 2023
Authors
Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei
Description
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
h
mu-math
huggingface.co
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toloka (2025). mu-math [Dataset]. https://huggingface.co/datasets/toloka/mu-math
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2025
Dataset authored and provided by
Toloka
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
μ-MATH (Meta U-MATH) is a meta-evaluation dataset derived from the U-MATH benchmark. It is intended to assess the ability of LLMs to judge free-form mathematical solutions. The dataset includes 1,084 labeled samples generated from 271 U-MATH tasks, covering problems of varying assessment complexity. For fine-grained performance evaluation results, in-depth analyses and detailed discussions on behaviors and biases of LLM judges, check out our paper.

📊 U-MATH benchmark at Huggingface 🔎 μ-MATH… See the full description on the dataset page: https://huggingface.co/datasets/toloka/mu-math.
Performance of DeepSeek-R1 compared to similar models in mathematics...
statista.com
Updated Jan 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Performance of DeepSeek-R1 compared to similar models in mathematics benchmarks 2025 [Dataset]. https://www.statista.com/statistics/1552888/deepseek-performance-of-deepseek-r1-compared-to-similar-models-by-math-benchmark/
Explore at:
Dataset updated
Jan 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2025
Area covered
China
Description
In a performance comparison on mathematics benchmarking in 2025, DeepSeek's AI model Deepseek-R1 outperformed all other representative models. The models from DeepSeek performed best in the mathematics and Chinese language benchmarks, and the weakest in coding.
putnam-axiom-dataset
kaggle.com
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ryati (2025). putnam-axiom-dataset [Dataset]. https://www.kaggle.com/datasets/ryati131457/putnam-axiom-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ryati
Description
Dataset of complex math problems with questions and answers.

This is originally from huggingface. This data is not mine, I am just uploading it.

https://huggingface.co/datasets/Putnam-AXIOM/putnam-axiom-dataset

@article{fronsdal2024putnamaxiom, title={Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning}, author={Kai Fronsdal and Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Bruno de Moraes Dumont and Sanmi Koyejo}, journal={NeurIPS 2024 Workshop on MATH-AI}, year={2024}, month={October}, url={https://openreview.net/pdf?id=YXnwlZe0yf}, note={Published: 09 Oct 2024, Last Modified: 09 Oct 2024}, keywords={Benchmarks, Large Language Models, Mathematical Reasoning, Mathematics, Reasoning, Machine Learning}, abstract={As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming less challenging. These benchmarks, though foundational, no longer offer the complexity necessary to evaluate the cutting edge of artificial reasoning. In this paper, we present the Putnam-AXIOM Original benchmark, a dataset of 236 challenging problems from the William Lowell Putnam Mathematical Competition, along with detailed step-by-step solutions. To address the potential data contamination of Putnam problems, we create functional variations for 53 problems in Putnam-AXIOM. We see that most models get a significantly lower accuracy on the variations than the original problems. Even so, our results reveal that Claude-3.5 Sonnet, the best-performing model, achieves 15.96% accuracy on the Putnam-AXIOM original but experiences more than a 50% reduction in accuracy on the variations dataset when compared to its performance on corresponding original problems.}, license={Apache 2.0} }
h
hendrycks-MATH-benchmark
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minqi Jiang (2025). hendrycks-MATH-benchmark [Dataset]. https://huggingface.co/datasets/minqi/hendrycks-MATH-benchmark
Explore at:
Dataset updated
Jun 17, 2025
Authors
Minqi Jiang
Description
minqi/hendrycks-MATH-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community
Major AI models, by math and computational reasoning
statista.com
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Major AI models, by math and computational reasoning [Dataset]. https://www.statista.com/statistics/1600812/ai-math-benchmarking-ranking/
Explore at:
Dataset updated
Mar 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
Worldwide
Description
In 2024, the artificial analysis math index ranked AI models based on their mathematical reasoning using benchmarks like AIME 2024 and Math-500. o1, QwQ-32B, and DeepSeek R1, led the rankings, showing the highest proficiency in mathematical problem solving.
P
GSM8K Dataset
paperswithcode.com
tensorflow.org
+2more
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
Explore at:
Dataset updated
Dec 31, 2024
Authors
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
Description
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.
P
MATH-V Dataset
paperswithcode.com
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ke Wang; Junting Pan; Weikang Shi; Zimu Lu; Mingjie Zhan; Hongsheng Li (2024). MATH-V Dataset [Dataset]. https://paperswithcode.com/dataset/math-v
Explore at:
Dataset updated
Sep 3, 2024
Authors
Ke Wang; Junting Pan; Weikang Shi; Zimu Lu; Mingjie Zhan; Hongsheng Li
Description
Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.

Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on Math-Vision, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development.
h
dart-math-uniform
huggingface.co
Updated Jul 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HKUST NLP Group (2024). dart-math-uniform [Dataset]. https://huggingface.co/datasets/hkust-nlp/dart-math-uniform
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2024
Dataset authored and provided by
HKUST NLP Group
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub 🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

Datasets: DART-Math

DART-Math datasets are the state-of-the-art and data-efficientopen-source instruction tuning datasets for mathematical reasoning.

Figure 1: Left: Average accuracy on 6 mathematical benchmarks. We compare with models… See the full description on the dataset page: https://huggingface.co/datasets/hkust-nlp/dart-math-uniform.
End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging...
zenodo.org
pdf
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IJNLC; IJNLC (2025). End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach [Dataset]. http://doi.org/10.5281/zenodo.14725164
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14725164
Dataset updated
Jan 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
IJNLC; IJNLC
Description
End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach

Authors

H.M.Shadman Tabib and Jaber Ahmed Deedar, Bangladesh University of Engineering and Technology, Bangladesh

Abstract

This work introduces systematic approach for enhancing large language models (LLMs) to address Bangla AI mathematical challenges. Through the assessment of diverse LLM configurations, fine- tuning with specific datasets, and the implementation of Retrieval-Augmented Generation (RAG), we enhanced the model’s reasoning precision in a multilingual setting. Crucial discoveries indicate that cus- tomized prompting, dataset augmentation, and iterative reasoning improve the model’s efficiency regarding Olympiad-level mathematical challenges.

Keywords

Large Language Models (LLMs), Fine-Tuning, Bangla AI, Mathematical Reasoning, Retrieval- Augmented Generation (RAG), Multilingual Setting, Customized Prompting, Dataset Augmentation, It- erative Reasoning, Olympiad-Level Challenges.
Ranking of LLM tools in solving math problems 2024
statista.com
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of LLM tools in solving math problems 2024 [Dataset]. https://www.statista.com/statistics/1458141/leading-math-llm-tools/
Explore at:
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 2024
Area covered
Worldwide
Description
As of March 2024, OpenAI o1 was the large language model (LLM) tool that had the best benchmark score in solving math problems, with a score of **** percent. Close behind, in second place, was OpenAI o1-mini, followed by GPT-4o.
o
Arithmetic Word Problem Compendium
opendatabay.com
.undefined
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cephalopod Studio (2025). Arithmetic Word Problem Compendium [Dataset]. https://www.opendatabay.com/data/ai-ml/b3f879dd-c434-4df5-bb6d-430513edf930
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Cephalopod Studio
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Machine Learning and AI
Description
Arithmetic Word Problem Compendium Dataset (AWPCD)

Dataset Description

The dataset is a comprehensive collection of mathematical word problems spanning multiple domains with rich metadata and natural language variations. The problems contain 1 - 5 steps of mathematical operations that are specifically designed to encourage showing work and maintaining appropriate decimal precision throughout calculations.

All the problems have never been seen before and are free from copyright restrictions.

The available data has 100,000 problems. To license the templating system that created the data for magnitudes more data or customizations like the number of mathematical steps involved, and the addition of domains. Contact hello@cephalopod.studio for more information.

Intended Uses & Limitations

Intended Uses: The data can be used in 4 areas: 1) Pretraining 2) Instruction tuning 3) Finetuning 4) Benchmarking existing models

All those areas are in service of: - Training mathematical reasoning systems - Developing step-by-step problem-solving capabilities - Testing arithmetic operations across diverse real-world contexts - Evaluating precision in decimal calculations

Limitations: - Currently English-only - Limited to specific mathematical operations - Template-based generation may introduce structural patterns - Focused on arithmetic operations with up to 5 numbers

Dataset Description

Dataset Summary

The dataset contains 100,000 total problems:

Problems span multiple domains including: - Agriculture (soil temperature changes, etc.) - Athletics (training hours, distances, etc.) - Construction (elevation changes, work hours, etc.) - Culinary (cooking temperature changes, calories per serving, etc.) - Education (GPA changes, etc.) - Entertainment (show ratings, stage lighting, etc.) - Finance (stock prices, account balances, etc.)

Data Format

Each example is provided in JSONL format with the following structure: json { "id": "problem_X", "question": "Text of the math problem", "metadata": { "discrete": boolean, "domain": string, "numbers": number[], "object_type": string, "solution": number, "operators": string[], "decimals": number }, "answer": "Text of the step-by-step solution to the problem" }

Sample Data

1. Finance (Account Management): "Jack sets up 19 bank accounts for clients. First the total rises to be 2 times greater than before. Following that, another 4 accounts were added." 2. Agriculture (Grain Storage): "Kevin oversees 14,457 metric tons of grain storage in the new concrete silo. In the beginning, the facility grows to 3 times its current measure of grain. Following that, the overall supply of grain grows by 1,514 tons. Then, Kevin divides the holdings evenly by 1 and continues operations with a single unit." 3. Temperature Monitoring: "The ground temperature measures 5.48 degrees Celsius. First, the temperature values are adjusted to be 1/3.74 the value they were before. Next, a sensor calibration adjustment multiplies all readings by 2.07, and later the measurements need to be scaled down by 1/3.94 due to sensor calibration. Later, the measurements need to be scaled down by 1/2.54 due to sensor calibration, and after that the measurements need to be scaled down by 1/2.21 due to sensor calibration. What is the final soil temperature in degrees Celsius? Round your answer and any steps to 2 decimal places."

Answer Examples

1. Finance (Account Management): "Here's how we can solve this problem: "19 accounts times 2 equals 38 "Addition step: 38 + 4 = 42 accounts "Based on these steps, the answer is 42." 2. Agriculture (Grain Storage): "Following these steps will give us the answer: "Multiplication operation: 14,457 tons * 3 = 43,371 "Add 1514 to 43,371 tons: 44,885 "x) 44,885 x 1 = 44,885 tons "Thus, we arrive at the answer: 44885.0." 3. Temperature Monitoring: "We can find the answer like this: "Division: 5.48 degrees ÷ 3.74 = (Note: rounding to 2 decimal places) about 1.47 "Multiplication operation: 1.47 degrees * 2.07 = (Note: rounding to 2 decimal places) approximately 3.04 "3.04 degrees ÷ 3.94 (Note: rounding to 2 decimal places) approximately 0.77 "0.77 degrees ÷ 2.54 (Note: rounding to 2 decimal places) approximately 0.30 "When 0.30 degrees are divided by 2.21, the result is (Note: rounding to 2 decimal places) about 0.14 "This means the final result is 0.14."

Features

Each problem includes: - Unique problem ID - Natural language question text - Includes arithemetic operations involving decimals and integers, values that are positive and negative, and requirements for rounding to a specific number of decimal places. - Detailed metadata including: - Domain classification - Object types and units - Numerical values used - Mathematical operators - Solution value - Discreteness flag - Decimal precision - Tailored value ranges
a
Math Index by o3-pro Endpoint
artificialanalysis.ai
Updated Jun 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Math Index by o3-pro Endpoint [Dataset]. https://artificialanalysis.ai/models/o3-pro
Explore at:
Dataset updated
Jun 10, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model
a
Math Index by MiniMax Endpoint
artificialanalysis.ai
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Math Index by MiniMax Endpoint [Dataset]. https://artificialanalysis.ai/models/minimax-m1-40k
Explore at:
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model
p
Trends in Math Proficiency (2011-2022): Benchmark School vs. Arizona vs....
publicschoolreview.com
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2024). Trends in Math Proficiency (2011-2022): Benchmark School vs. Arizona vs. Benchmark School Inc. (10972) School District [Dataset]. https://www.publicschoolreview.com/benchmark-school-profile
Explore at:
Dataset updated
Apr 6, 2024
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual math proficiency from 2011 to 2022 for Benchmark School vs. Arizona and Benchmark School Inc. (10972) School District
P
MiniF2F Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kunhao Zheng; Jesse Michael Han; Stanislas Polu (2024). MiniF2F Dataset [Dataset]. https://paperswithcode.com/dataset/minif2f
Explore at:
Dataset updated
Aug 14, 2024
Authors
Kunhao Zheng; Jesse Michael Han; Stanislas Polu
Description
MiniF2F is a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math

MATH Dataset

Explore at:

Dataset updated

Jan 10, 2025

Authors

Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt

Description

MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

Clear search

Close search

Google apps

Main menu

MATH Dataset

Math benchmark comparison between Gemini Ultra and GPT-4 in 2024

gsm8k

MathBench Dataset

Data from: MGSM Dataset

mu-math

Performance of DeepSeek-R1 compared to similar models in mathematics...

putnam-axiom-dataset

hendrycks-MATH-benchmark

Major AI models, by math and computational reasoning

GSM8K Dataset

MATH-V Dataset

dart-math-uniform

End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging...

Ranking of LLM tools in solving math problems 2024

Arithmetic Word Problem Compendium

Arithmetic Word Problem Compendium Dataset (AWPCD)

Dataset Description

Intended Uses & Limitations

Dataset Description

Dataset Summary

Data Format

Sample Data

Answer Examples

Features

Math Index by o3-pro Endpoint

Math Index by MiniMax Endpoint

Trends in Math Proficiency (2011-2022): Benchmark School vs. Arizona vs....

MiniF2F Dataset

MATH DatasetSee More Versions

MATH Dataset