https://choosealicense.com/licenses/llama3.3/https://choosealicense.com/licenses/llama3.3/
SwallowMath
Resources
🐙 GitHub: Explore the project repository, including pipeline code and prompts at rioyokotalab/swallow-code-math. 📑 arXiv: Read our paper for detailed methodology and results at arXiv:2505.02881. 🤗 Sister Dataset: Discover SwallowCode, our companion dataset for code generation.
What is it?
SwallowMath is a high-quality mathematical dataset comprising approximately 2.3 billion tokens derived from the FineMath-4+ dataset through an… See the full description on the dataset page: https://huggingface.co/datasets/tokyotech-llm/swallow-math.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
μ-MATH (Meta U-MATH) is a meta-evaluation dataset derived from the U-MATH benchmark. It is intended to assess the ability of LLMs to judge free-form mathematical solutions. The dataset includes 1,084 labeled samples generated from 271 U-MATH tasks, covering problems of varying assessment complexity. For fine-grained performance evaluation results, in-depth analyses and detailed discussions on behaviors and biases of LLM judges, check out our paper.
📊 U-MATH benchmark at Huggingface 🔎 μ-MATH… See the full description on the dataset page: https://huggingface.co/datasets/toloka/mu-math.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is the training dataset for the paper: "On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning" (https://arxiv.org/abs/2505.17508).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description
Synthesizer-8B-math-train-data originates from the paper: CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis available on arXiv. You can visit the repo to learn more about the paper.
Citation
If you find our paper helpful, please cite the original paper: @misc{zhang2025cotbasedsynthesizerenhancingllm, title={CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis}, author={Bohan Zhang and Xiaokang… See the full description on the dataset page: https://huggingface.co/datasets/BoHanMint/Synthesizer-8B-math-train-data.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
SAND-Math: A Synthetic Dataset of Difficult Problems to Elevate LLM Math Performance
📃 Paper | 🤗 Dataset SAND-Math (Synthetic Augmented Novel and Difficult Mathematics) is a high-quality, high-difficulty dataset of mathematics problems and solutions. It is generated using a novel pipeline that addresses the critical bottleneck of scarce, high-difficulty training data for mathematical Large Language Models (LLMs).
Key Features
Novel Problem Generation: Problems are… See the full description on the dataset page: https://huggingface.co/datasets/amd/SAND-MATH.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for GSM8K
Dataset Summary
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One way to steer generations from large language models (LLM) is to assign a persona: a role that describes how the user expects the LLM to behave (e.g., a helpful assistant, a teacher, a woman). This paper investigates how personas affect diverse aspects of model behavior. We assign to seven LLMs 162 personas from 12 categories spanning variables like gender, sexual orientation, and occupation. We prompt them to answer questions from five datasets covering objective (e.g., questions about math and history) and subjective tasks (e.g., questions about beliefs and values). We also compare persona’s generations to two baseline settings: a control persona setting with 30 paraphrases of “a helpful assistant” to control for models’ prompt sensitivity, and an empty persona setting where no persona is assigned. We find that for all models and datasets, personas show greater variability than the control setting and that some measures of persona behavior generalize across models.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset
[💻 Github] [🌐 Homepage] [📊 Leaderboard ] [📊 Open Source Leaderboard ] [🔍 Visualization] [📖 Paper]
🚀 Data Usage
from datasets import load_dataset
dataset = load_dataset("MathLLMs/MathVision") print(dataset)
💥 News
[2025.05.16] 💥 We now support the official open-source leaderboard! 🔥🔥🔥 Skywork-R1V2-38B is the best open-source model, scoring 49.7% on MATH-Vision. 🔥🔥🔥… See the full description on the dataset page: https://huggingface.co/datasets/MathLLMs/MathVision.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for ReliableMath
Dataset Description
A mathematical reasoning dataset including both solvable and unsolvable math problems to evaluate LLM reliability on reasoning tasks.
Repository: GitHub Paper: arXiv Leaderboard: Leaderboard Point of Contact: byxue@se.cuhk.edu.hk
The following are the illustrations of (a) an unreliable LLM may fabricate incorrect or nonsensical content on math problems; (b) a reliable LLM can correctly answer solvable problems or… See the full description on the dataset page: https://huggingface.co/datasets/BeyondHsueh/ReliableMath.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compositional GSM_augmented
Compositional GSM_augmented is a math instruction dataset, inspired by Not All LLM Reasoners Are Created Equal. It is based on nvidia/OpenMathInstruct-2 dataset, so you can use this dataset as training dataset. It is generated using meta-llama/Meta-Llama-3.1-70B-Instruct model by Hyperbloic AI link. (Thanks for free credit!) Replace the description of the data with the contents in the paper.
Each question in compositional GSM consists of two questions… See the full description on the dataset page: https://huggingface.co/datasets/ChuGyouk/CompositionalGSM_augmented.
Safe (ACL 2025 Main)
TL;DR: A Lean 4 theorem-proving dataset, where these theorems are used to validate the correctness of LLM mathematical reasoning steps, synthesized using Safe. The official implementation of our paper Safe (Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification) and its associated datasets FormalStep. Paper Code Dataset
Citation
If you find our work useful, please consider citing our paper.… See the full description on the dataset page: https://huggingface.co/datasets/liuchengwu/FormalStep.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
JEEBench(EMNLP 2023)
Repository for the code and dataset for the paper: "Have LLMs Advanced Enough? A Harder Problem Solving Benchmark For Large Language Models" accepted in EMNLP 2023 as a Main conference paper. https://aclanthology.org/2023.emnlp-main.468/
Citation
If you use our dataset in your research, please cite it using the following @inproceedings{arora-etal-2023-llms, title = "Have {LLM}s Advanced Enough? A Challenging Problem Solving Benchmark For Large… See the full description on the dataset page: https://huggingface.co/datasets/daman1209arora/jeebench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nemotron-MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
Authors: Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro [Paper] [Blog]
Dataset Description
Figure 1: Math Informed syNthetic Dialogue. We (a) manually design prompts of seven conversational styles, (b) provide the prompt along with raw context as input to an LLM to obtain diverse synthetic… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-MIND.
Dataset Description
The "arxiv_small_nougat" dataset is a collection of 108 recent papers sourced from arXiv, focusing on topics related to Large Language Models (LLM) and Transformers. These papers have been meticulously processed and parsed using Meta's Nougat model, which is specifically designed to retain the integrity of complex elements such as tables and mathematical equations.
Data Format
The dataset contains the parsed content of the selected papers, with special… See the full description on the dataset page: https://huggingface.co/datasets/deep-learning-analytics/arxiv_small_nougat.
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
CMM-Math
💻 Github Repo 💻 Paper Link 💻 Math-LLM-7B 💻 Math-LLM-7B
📥 Download Supplementary Material
Introduction
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal… See the full description on the dataset page: https://huggingface.co/datasets/ecnu-icalk/cmm-math.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for RLPR-Train-Dataset
GitHub | Paper
News:
[2025.06.23] 📃 Our paper detailing the RLPR framework and this dataset is accessible at here.
Dataset Summary
The RLPR-Train-Dataset is a curated collection of 77k high-quality reasoning prompts specifically designed for enhancing Large Language Model (LLM) capabilities in the general domain (non-mathematical). This dataset is derived from the comprehensive collection of prompts from WebInstruct. We… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/RLPR-Train-Dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🦣 MAmmoTH2: Scaling Instructions from the Web
Project Page: https://tiger-ai-lab.github.io/MAmmoTH2/ Paper: https://arxiv.org/pdf/2405.03548 Code: https://github.com/TIGER-AI-Lab/MAmmoTH2
WebInstruct (Subset)
This repo contains the partial dataset used in "MAmmoTH2: Scaling Instructions from the Web". This partial data is coming mostly from the forums like stackexchange. This subset contains very high-quality data to boost LLM performance through instruction tuning.… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/WebInstructSub.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We request you not to publish examples of this dataset online in plain text to reduce the risk of leakage into foundation model training corpora.
ReaLMistake
ReaLMistake is a benchmark proposed in the paper "Evaluating LLMs at Detecting Errors in LLM Responses" (COLM 2024). ReaLMistake is a benchmark for evaluating binary error detection methods that detect errors in LLM responses. This benchmark includes natural errors made by GPT-4 and Llama 2 70B on three tasks (math word… See the full description on the dataset page: https://huggingface.co/datasets/ryokamoi/realmistake.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
X-SVAMP
🤗 Paper | 📖 arXiv
Dataset Description
X-SVAMP is an evaluation benchmark for multilingual large language models (LLMs), including questions and answers in 5 languages (English, Chinese, Korean, Italian and Spanish). It is intended to evaluate the math reasoning abilities of LLMs. The dataset is translated by GPT-4-turbo from the original English-version SVAMP. In our paper, we evaluate LLMs in a zero-shot generative setting: prompt the instruction-tuned LLM with… See the full description on the dataset page: https://huggingface.co/datasets/zhihz0535/X-SVAMP_en_zh_ko_it_es.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/llama3.3/https://choosealicense.com/licenses/llama3.3/
SwallowMath
Resources
🐙 GitHub: Explore the project repository, including pipeline code and prompts at rioyokotalab/swallow-code-math. 📑 arXiv: Read our paper for detailed methodology and results at arXiv:2505.02881. 🤗 Sister Dataset: Discover SwallowCode, our companion dataset for code generation.
What is it?
SwallowMath is a high-quality mathematical dataset comprising approximately 2.3 billion tokens derived from the FineMath-4+ dataset through an… See the full description on the dataset page: https://huggingface.co/datasets/tokyotech-llm/swallow-math.