Facebook
Twitterai2-adapt-dev/openmath-2-math dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OpenMathReasoning
OpenMathReasoning is a large-scale math reasoning dataset for training large language models (LLMs). This dataset contains
306K unique mathematical problems sourced from AoPS forums with: 3.2M long chain-of-thought (CoT) solutions 1.7M long tool-integrated reasoning (TIR) solutions 566K samples that select the most promising solution out of many candidates (GenSelect)
Additional 193K problems sourced from AoPS forums (problems only, no solutions)
We used… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMathReasoning.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenR1-Math-220k
Dataset description
OpenR1-Math-220k is a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. The traces were verified using Math Verify for most samples and Llama-3.3-70B-Instruct as a judge for 12% of the samples, and each problem contains at least one reasoning trace with a correct answer. The dataset consists of two splits:… See the full description on the dataset page: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
OpenMath GSM8K Masked
We release a masked version of the MATH solutions. This data can be used to aid synthetic generation of additional solutions for MATH dataset as it is much less likely to lead to inconsistent reasoning compared to using the original solutions directly. This dataset was used to construct OpenMathInstruct-1: a math instruction tuning dataset with 1.8M problem-solution pairs generated using permissively licensed Mixtral-8x7B model. For details of how the masked… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMath-MATH-masked.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
OpenMathInstruct-1
OpenMathInstruct-1 is a math instruction tuning dataset with 1.8M problem-solution pairs generated using permissively licensed Mixtral-8x7B model. The problems are from GSM8K and MATH training subsets and the solutions are synthetically generated by allowing Mixtral model to use a mix of text reasoning and code blocks executed by Python interpreter. The dataset is split into train and validation subsets that we used in the ablations experiments. These two subsets… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMathInstruct-1.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
OpenMath GSM8K Masked
We release a masked version of the GSM8K solutions. This data can be used to aid synthetic generation of additional solutions for GSM8K dataset as it is much less likely to lead to inconsistent reasoning compared to using the original solutions directly. This dataset was used to construct OpenMathInstruct-1: a math instruction tuning dataset with 1.8M problem-solution pairs generated using permissively licensed Mixtral-8x7B model. For details of how the masked… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMath-GSM8K-masked.
Facebook
TwitterMcGill-NLP/openmath-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSMSHAH/open-math-reasoning dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSFT Format Dataset
Overview
This dataset is converted to SFT (Supervised Fine-Tuning) format. It was created by transforming OpenMathInstruct and Stanford Human Preferences (SHP) datasets.
Dataset Structure
Each entry follows this format: Instruction: [Problem, question, or conversation history] Response: [Solution, answer, or response]
Usage Guide
Loading the Dataset
from datasets import load_dataset
Facebook
Twitterd1shs0ap/openmath-reasoning-hard-extracted-solution dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterketchup123/tulu-gsm8k-openmath-instruct-100k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OpenMathInstruct-2
OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model. The training set problems of GSM8K and MATH are used for constructing the dataset in the following ways:
Solution augmentation: Generating chain-of-thought solutions for training set problems in GSM8K and MATH. Problem-Solution augmentation: Generating new problems, followed by solutions for these new problems.… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📐 OpenMath-Difficulty-Annotated
🚀 Overview
OpenMath-Difficulty-Annotated is a curated subset of OpenMathInstruct-2 containing 10,176 math problems, enhanced with precise difficulty metadata. While the original solutions are preserved from NVIDIA's dataset, we employed a 120B Parameter Model (LLM-as-a-Judge) to analyze and grade every single problem on a scale of 1 to 5. This allows developers of Small Language Models (1B-3B) to filter out "Olympiad-level" noise… See the full description on the dataset page: https://huggingface.co/datasets/HAD653/OpenMath-Difficulty-Annotated.
Facebook
Twitterykarout/nvidia-openmath-phi-format-unbalanced dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSFT Format Dataset
Overview
This dataset is converted to SFT (Supervised Fine-Tuning) format. It was created by transforming OpenMathInstruct and Stanford Human Preferences (SHP) datasets.
Dataset Structure
Each entry follows this format: Instruction: [Problem, question, or conversation history] Response: [Solution, answer, or response]
Usage Guide
Loading the Dataset
from datasets import load_dataset
Facebook
TwitterSeono/sft-dedup-openmath-eval dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittermikasenghaas/OpenMath-Nemotron-7B-AIME25 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterai2-adapt-dev/openmath-2-gsm8k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterunsloth/OpenMathReasoning-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterai2-adapt-dev/openmath-2-math dataset hosted on Hugging Face and contributed by the HF Datasets community