100+ datasets found
  1. P

    MATH Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math
    Explore at:
    Dataset updated
    Jan 10, 2025
    Authors
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt
    Description

    MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

  2. Math benchmark comparison between Gemini Ultra and GPT-4 in 2024

    • statista.com
    Updated Jul 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Math benchmark comparison between Gemini Ultra and GPT-4 in 2024 [Dataset]. https://www.statista.com/statistics/1450845/math-benchmark-comparison-gemini-ultra-gpt-4/
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 2023
    Area covered
    Worldwide
    Description

    Gemini Ultra and GPT-4 are the leading generative AI platforms worldwide, and they compare alike in the mathematical benchmarks. Gemini Ultra is slightly ahead of GPT-4, beating it in both mathematical benchmarks as well as the general MMLU benchmarking by around * percent. The lead is not so significant as to consider GPT-4 a lackluster model.

  3. h

    gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  4. P

    MathBench Dataset

    • paperswithcode.com
    Updated May 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongwei Liu; Zilong Zheng; Yuxuan Qiao; Haodong Duan; Zhiwei Fei; Fengzhe Zhou; Wenwei Zhang; Songyang Zhang; Dahua Lin; Kai Chen (2025). MathBench Dataset [Dataset]. https://paperswithcode.com/dataset/mathbench
    Explore at:
    Dataset updated
    May 25, 2025
    Authors
    Hongwei Liu; Zilong Zheng; Yuxuan Qiao; Haodong Duan; Zhiwei Fei; Fengzhe Zhou; Wenwei Zhang; Songyang Zhang; Dahua Lin; Kai Chen
    Description

    MathBench is an All in One math dataset for language model evaluation, with:

    A Sophisticated Five-Stage Difficulty Mechanism: Unlike the usual mathematical datasets that can only evaluate a single difficulty level or have a mix of unclear difficulty levels, MathBench provides 3709 questions with a gradient difficulty division by education stages, ranging from basic arithmetic to primary, middle, high school, and college levels, allowing you to get a clear overview of the comprehensive difficulty evaluation results.

    Bilingual Gradient Evaluation: Apart from the basic calculation part which is language-independent, MathBench provides questions in both Chinese and English for the four-stage difficulty datasets from primary to college.

    Implementation of the Robust Circular Evaluation (CE) Method: MathBench use CE as the main evaluation method for questions. Compared to the general Accuracy evaluation method, CE requires the model to answer the same multiple-choice question multiple times, with the order of the options changing each time. The model is considered correct on this question only if all answers are correct. The results of CE can reflect the model's capabilities more realistically, providing more valuable evaluation results.

    Support for Basic Theory Questions: For every stage, MathBench provides questions that cover the basic theory knowledge points of the corresponding stage, to ascertain whether the model has genuinely mastered the fundamental concepts of each stage or merely memorized the answers.

  5. P

    Data from: MGSM Dataset

    • paperswithcode.com
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei (2023). MGSM Dataset [Dataset]. https://paperswithcode.com/dataset/mgsm
    Explore at:
    Dataset updated
    Aug 20, 2023
    Authors
    Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei
    Description

    Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

  6. h

    mu-math

    • huggingface.co
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toloka (2025). mu-math [Dataset]. https://huggingface.co/datasets/toloka/mu-math
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Dataset authored and provided by
    Toloka
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    μ-MATH (Meta U-MATH) is a meta-evaluation dataset derived from the U-MATH benchmark. It is intended to assess the ability of LLMs to judge free-form mathematical solutions. The dataset includes 1,084 labeled samples generated from 271 U-MATH tasks, covering problems of varying assessment complexity. For fine-grained performance evaluation results, in-depth analyses and detailed discussions on behaviors and biases of LLM judges, check out our paper.

    📊 U-MATH benchmark at Huggingface 🔎 μ-MATH… See the full description on the dataset page: https://huggingface.co/datasets/toloka/mu-math.

  7. Performance of DeepSeek-R1 compared to similar models in mathematics...

    • statista.com
    Updated Jan 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Performance of DeepSeek-R1 compared to similar models in mathematics benchmarks 2025 [Dataset]. https://www.statista.com/statistics/1552888/deepseek-performance-of-deepseek-r1-compared-to-similar-models-by-math-benchmark/
    Explore at:
    Dataset updated
    Jan 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2025
    Area covered
    China
    Description

    In a performance comparison on mathematics benchmarking in 2025, DeepSeek's AI model Deepseek-R1 outperformed all other representative models. The models from DeepSeek performed best in the mathematics and Chinese language benchmarks, and the weakest in coding.

  8. putnam-axiom-dataset

    • kaggle.com
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ryati (2025). putnam-axiom-dataset [Dataset]. https://www.kaggle.com/datasets/ryati131457/putnam-axiom-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ryati
    Description

    Dataset of complex math problems with questions and answers.

    This is originally from huggingface. This data is not mine, I am just uploading it.

    https://huggingface.co/datasets/Putnam-AXIOM/putnam-axiom-dataset

    @article{fronsdal2024putnamaxiom, title={Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning}, author={Kai Fronsdal and Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Bruno de Moraes Dumont and Sanmi Koyejo}, journal={NeurIPS 2024 Workshop on MATH-AI}, year={2024}, month={October}, url={https://openreview.net/pdf?id=YXnwlZe0yf}, note={Published: 09 Oct 2024, Last Modified: 09 Oct 2024}, keywords={Benchmarks, Large Language Models, Mathematical Reasoning, Mathematics, Reasoning, Machine Learning}, abstract={As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming less challenging. These benchmarks, though foundational, no longer offer the complexity necessary to evaluate the cutting edge of artificial reasoning. In this paper, we present the Putnam-AXIOM Original benchmark, a dataset of 236 challenging problems from the William Lowell Putnam Mathematical Competition, along with detailed step-by-step solutions. To address the potential data contamination of Putnam problems, we create functional variations for 53 problems in Putnam-AXIOM. We see that most models get a significantly lower accuracy on the variations than the original problems. Even so, our results reveal that Claude-3.5 Sonnet, the best-performing model, achieves 15.96% accuracy on the Putnam-AXIOM original but experiences more than a 50% reduction in accuracy on the variations dataset when compared to its performance on corresponding original problems.}, license={Apache 2.0} }

  9. h

    hendrycks-MATH-benchmark

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minqi Jiang (2025). hendrycks-MATH-benchmark [Dataset]. https://huggingface.co/datasets/minqi/hendrycks-MATH-benchmark
    Explore at:
    Dataset updated
    Jun 17, 2025
    Authors
    Minqi Jiang
    Description

    minqi/hendrycks-MATH-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Major AI models, by math and computational reasoning

    • statista.com
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Major AI models, by math and computational reasoning [Dataset]. https://www.statista.com/statistics/1600812/ai-math-benchmarking-ranking/
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    Worldwide
    Description

    In 2024, the artificial analysis math index ranked AI models based on their mathematical reasoning using benchmarks like AIME 2024 and Math-500. o1, QwQ-32B, and DeepSeek R1, led the rankings, showing the highest proficiency in mathematical problem solving.

  11. P

    GSM8K Dataset

    • paperswithcode.com
    • tensorflow.org
    • +2more
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
    Explore at:
    Dataset updated
    Dec 31, 2024
    Authors
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
    Description

    GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

  12. P

    MATH-V Dataset

    • paperswithcode.com
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke Wang; Junting Pan; Weikang Shi; Zimu Lu; Mingjie Zhan; Hongsheng Li (2024). MATH-V Dataset [Dataset]. https://paperswithcode.com/dataset/math-v
    Explore at:
    Dataset updated
    Sep 3, 2024
    Authors
    Ke Wang; Junting Pan; Weikang Shi; Zimu Lu; Mingjie Zhan; Hongsheng Li
    Description

    Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.

    Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on Math-Vision, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development.

  13. h

    dart-math-uniform

    • huggingface.co
    Updated Jul 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HKUST NLP Group (2024). dart-math-uniform [Dataset]. https://huggingface.co/datasets/hkust-nlp/dart-math-uniform
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2024
    Dataset authored and provided by
    HKUST NLP Group
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

    📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub 🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

      Datasets: DART-Math
    

    DART-Math datasets are the state-of-the-art and data-efficientopen-source instruction tuning datasets for mathematical reasoning.

    Figure 1: Left: Average accuracy on 6 mathematical benchmarks. We compare with models… See the full description on the dataset page: https://huggingface.co/datasets/hkust-nlp/dart-math-uniform.

  14. End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging...

    • zenodo.org
    pdf
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IJNLC; IJNLC (2025). End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach [Dataset]. http://doi.org/10.5281/zenodo.14725164
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    IJNLC; IJNLC
    Description

    End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach

    Authors

    H.M.Shadman Tabib and Jaber Ahmed Deedar, Bangladesh University of Engineering and Technology, Bangladesh

    Abstract

    This work introduces systematic approach for enhancing large language models (LLMs) to address Bangla AI mathematical challenges. Through the assessment of diverse LLM configurations, fine- tuning with specific datasets, and the implementation of Retrieval-Augmented Generation (RAG), we enhanced the model’s reasoning precision in a multilingual setting. Crucial discoveries indicate that cus- tomized prompting, dataset augmentation, and iterative reasoning improve the model’s efficiency regarding Olympiad-level mathematical challenges.

    Keywords

    Large Language Models (LLMs), Fine-Tuning, Bangla AI, Mathematical Reasoning, Retrieval- Augmented Generation (RAG), Multilingual Setting, Customized Prompting, Dataset Augmentation, It- erative Reasoning, Olympiad-Level Challenges.

  15. Ranking of LLM tools in solving math problems 2024

    • statista.com
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of LLM tools in solving math problems 2024 [Dataset]. https://www.statista.com/statistics/1458141/leading-math-llm-tools/
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 2024
    Area covered
    Worldwide
    Description

    As of March 2024, OpenAI o1 was the large language model (LLM) tool that had the best benchmark score in solving math problems, with a score of **** percent. Close behind, in second place, was OpenAI o1-mini, followed by GPT-4o.

  16. o

    Arithmetic Word Problem Compendium

    • opendatabay.com
    .undefined
    Updated Jun 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cephalopod Studio (2025). Arithmetic Word Problem Compendium [Dataset]. https://www.opendatabay.com/data/ai-ml/b3f879dd-c434-4df5-bb6d-430513edf930
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset authored and provided by
    Cephalopod Studio
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Machine Learning and AI
    Description

    Arithmetic Word Problem Compendium Dataset (AWPCD)

    Dataset Description

    The dataset is a comprehensive collection of mathematical word problems spanning multiple domains with rich metadata and natural language variations. The problems contain 1 - 5 steps of mathematical operations that are specifically designed to encourage showing work and maintaining appropriate decimal precision throughout calculations.

    All the problems have never been seen before and are free from copyright restrictions.

    The available data has 100,000 problems. To license the templating system that created the data for magnitudes more data or customizations like the number of mathematical steps involved, and the addition of domains. Contact hello@cephalopod.studio for more information.

    Intended Uses & Limitations

    Intended Uses: The data can be used in 4 areas: 1) Pretraining 2) Instruction tuning 3) Finetuning 4) Benchmarking existing models

    All those areas are in service of: - Training mathematical reasoning systems - Developing step-by-step problem-solving capabilities - Testing arithmetic operations across diverse real-world contexts - Evaluating precision in decimal calculations

    Limitations: - Currently English-only - Limited to specific mathematical operations - Template-based generation may introduce structural patterns - Focused on arithmetic operations with up to 5 numbers

    Dataset Description

    Dataset Summary

    The dataset contains 100,000 total problems:

    Problems span multiple domains including: - Agriculture (soil temperature changes, etc.) - Athletics (training hours, distances, etc.) - Construction (elevation changes, work hours, etc.) - Culinary (cooking temperature changes, calories per serving, etc.) - Education (GPA changes, etc.) - Entertainment (show ratings, stage lighting, etc.) - Finance (stock prices, account balances, etc.)

    Data Format

    Each example is provided in JSONL format with the following structure: json { "id": "problem_X", "question": "Text of the math problem", "metadata": { "discrete": boolean, "domain": string, "numbers": number[], "object_type": string, "solution": number, "operators": string[], "decimals": number }, "answer": "Text of the step-by-step solution to the problem" }

    Sample Data

    1. Finance (Account Management):
    "Jack sets up 19 bank accounts for clients. First the total rises to be 2 times greater than before. Following that, another 4 accounts were added."
    
    2. Agriculture (Grain Storage):
    "Kevin oversees 14,457 metric tons of grain storage in the new concrete silo. In the beginning, the facility grows to 3 times its current measure of grain. Following that, the overall supply of grain grows by 1,514 tons. Then, Kevin divides the holdings evenly by 1 and continues operations with a single unit."
    
    3. Temperature Monitoring:
    "The ground temperature measures 5.48 degrees Celsius. First, the temperature values are adjusted to be 1/3.74 the value they were before. Next, a sensor calibration adjustment multiplies all readings by 2.07, and later the measurements need to be scaled down by 1/3.94 due to sensor calibration. Later, the measurements need to be scaled down by 1/2.54 due to sensor calibration, and after that the measurements need to be scaled down by 1/2.21 due to sensor calibration. What is the final soil temperature in degrees Celsius? Round your answer and any steps to 2 decimal places."
    

    Answer Examples

    1. Finance (Account Management):
    "Here's how we can solve this problem:
    "19 accounts times 2 equals 38
    "Addition step: 38 + 4 = 42 accounts
    
    "Based on these steps, the answer is 42."
    
    2. Agriculture (Grain Storage):
    "Following these steps will give us the answer:
    "Multiplication operation: 14,457 tons * 3 = 43,371
    "Add 1514 to 43,371 tons: 44,885
    "x) 44,885 x 1 = 44,885 tons
    
    "Thus, we arrive at the answer: 44885.0."
    
    3. Temperature Monitoring:
    "We can find the answer like this:
    "Division: 5.48 degrees ÷ 3.74 = (Note: rounding to 2 decimal places) about 1.47
    "Multiplication operation: 1.47 degrees * 2.07 = (Note: rounding to 2 decimal places) approximately 3.04
    "3.04 degrees ÷ 3.94 (Note: rounding to 2 decimal places) approximately 0.77
    "0.77 degrees ÷ 2.54 (Note: rounding to 2 decimal places) approximately 0.30
    "When 0.30 degrees are divided by 2.21, the result is (Note: rounding to 2 decimal places) about 0.14
    
    "This means the final result is 0.14."
    

    Features

    Each problem includes: - Unique problem ID - Natural language question text - Includes arithemetic operations involving decimals and integers, values that are positive and negative, and requirements for rounding to a specific number of decimal places. - Detailed metadata including: - Domain classification - Object types and units - Numerical values used - Mathematical operators - Solution value - Discreteness flag - Decimal precision - Tailored value ranges

  17. a

    Math Index by o3-pro Endpoint

    • artificialanalysis.ai
    Updated Jun 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Math Index by o3-pro Endpoint [Dataset]. https://artificialanalysis.ai/models/o3-pro
    Explore at:
    Dataset updated
    Jun 10, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model

  18. a

    Math Index by MiniMax Endpoint

    • artificialanalysis.ai
    Updated Jun 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Math Index by MiniMax Endpoint [Dataset]. https://artificialanalysis.ai/models/minimax-m1-40k
    Explore at:
    Dataset updated
    Jun 18, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model

  19. p

    Trends in Math Proficiency (2011-2022): Benchmark School vs. Arizona vs....

    • publicschoolreview.com
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public School Review (2024). Trends in Math Proficiency (2011-2022): Benchmark School vs. Arizona vs. Benchmark School Inc. (10972) School District [Dataset]. https://www.publicschoolreview.com/benchmark-school-profile
    Explore at:
    Dataset updated
    Apr 6, 2024
    Dataset authored and provided by
    Public School Review
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset tracks annual math proficiency from 2011 to 2022 for Benchmark School vs. Arizona and Benchmark School Inc. (10972) School District

  20. P

    MiniF2F Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Aug 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kunhao Zheng; Jesse Michael Han; Stanislas Polu (2024). MiniF2F Dataset [Dataset]. https://paperswithcode.com/dataset/minif2f
    Explore at:
    Dataset updated
    Aug 14, 2024
    Authors
    Kunhao Zheng; Jesse Michael Han; Stanislas Polu
    Description

    MiniF2F is a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math

MATH Dataset

Explore at:
Dataset updated
Jan 10, 2025
Authors
Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt
Description

MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

Search
Clear search
Close search
Google apps
Main menu