40 datasets found

gsm8k
huggingface.co
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for GSM8K

Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
P
GSM8K Dataset
paperswithcode.com
tensorflow.org
+2more
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
Explore at:
Dataset updated
Dec 31, 2024
Authors
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
Description
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.
h
GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity
huggingface.co
Updated Jan 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyan C (2025). GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity [Dataset]. https://huggingface.co/datasets/thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2025
Authors
Shreyan C
Description
thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community
h
LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity
huggingface.co
Updated Jan 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyan C (2025). LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity [Dataset]. https://huggingface.co/datasets/thethinkmachine/LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2025
Authors
Shreyan C
Description
thethinkmachine/LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SNU_Ko-GSM8K
huggingface.co
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
THUNDER Research Group (2025). SNU_Ko-GSM8K [Dataset]. https://huggingface.co/datasets/thunder-research-group/SNU_Ko-GSM8K
Explore at:
Dataset updated
Jun 13, 2025
Dataset authored and provided by
THUNDER Research Group
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Note: Evaluation code for each benchmark dataset is under preparation and will be released soon to support standardized model assessment.

Dataset Card for Ko-GSM8K Dataset Summary

Ko-GSM8K is a Korean adaptation of the GSM8K dataset, which is composed of high-quality grade school-level math word problems. Each problem requires multi-step reasoning involving basic arithmetic operations. The Korean version translates and localizes each item with careful attention to… See the full description on the dataset page: https://huggingface.co/datasets/thunder-research-group/SNU_Ko-GSM8K.
h
Mr-GSM8K
huggingface.co
Updated Dec 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Randolphzeng (2023). Mr-GSM8K [Dataset]. https://huggingface.co/datasets/Randolphzeng/Mr-GSM8K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2023
Authors
Randolphzeng
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
View the project page: https://github.com/dvlab-research/DiagGSM8K see our paper at https://arxiv.org/abs/2312.17080

Description

In this work, we introduce a novel evaluation paradigm for Large Language Models, one that challenges them to engage in meta-reasoning. Our paradigm shifts the focus from result-oriented assessments, which often overlook the reasoning process, to a more holistic evaluation that effectively differentiates the cognitive capabilities among models. For… See the full description on the dataset page: https://huggingface.co/datasets/Randolphzeng/Mr-GSM8K.
P
GSM-Plus Dataset
paperswithcode.com
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qintong Li; Leyang Cui; Xueliang Zhao; Lingpeng Kong; Wei Bi (2025). GSM-Plus Dataset [Dataset]. https://paperswithcode.com/dataset/gsm-plus
Explore at:
Dataset updated
Jun 8, 2025
Authors
Qintong Li; Leyang Cui; Xueliang Zhao; Lingpeng Kong; Wei Bi
Description
By perturbing the widely used GSM8K dataset, an adversarial dataset for grade-school math called GSM-Plus is created. Motivated by the capability taxonomy for solving math problems mentioned in Polya's principles, this paper identifies 5 perspectives to guide the development of GSM-Plus:

Numerical Variation refers to altering the numerical data or its types, including 3 subcategories: Numerical Substitution, Digit Expansion, and Integer-decimal-fraction Conversion. Arithmetic Variation refers to reversing or introducing additional operations (e.g., addition, subtraction, multiplication, and division) to math problems, including 2 subcategories: Adding Operation and Reversing Operation. Problem Understanding refers to rephrasing the text description of the math problems. Distractor Insertion refers to inserting topic-related but useless sentences to the problems. Critical Thinking focuses on question or doubt ability when the question lacks necessary statements.

GSM-Plus can be used to evaluate the robustness of current LLMs in mathematical reasoning.
Elementary School Math Problems (Hard)
kaggle.com
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chagin (2024). Elementary School Math Problems (Hard) [Dataset]. https://www.kaggle.com/datasets/chayanonc/gsm-hard-elementary-school-math-problems
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Dataset provided by
Kaggle
Authors
Chagin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Description 📄

A diverse set of grade-school math problems aiming to evaluate and improve the mathematical reasoning abilities of language models. This dataset is a more challenging variant of the GSM8K. It was modified by substituting the numbers with larger, less frequently encountered values.

The Data 💬

Format: DataFrame with following features

Input: (Text) Math question

Henry made two stops during his 60-mile bike trip. He first stopped after 20 miles. His second stop was 15 miles before the end of the trip. How many miles did he travel between his first and second stops?

Target: (Number) The answer to the corresponding question

25

Use case 🤖

Math reasoning: Train your model to solve various math problems.

Don't know where to start? 🤔

Get start with our introduction notebook, where we utilize GPT (Generative-Pretrained-Transformers) and PAL (program-aided language models) to solve math problems step-by-step!!

HAPPY KAGGLING !!
P
Data from: MGSM Dataset
paperswithcode.com
Updated Aug 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei (2023). MGSM Dataset [Dataset]. https://paperswithcode.com/dataset/mgsm
Explore at:
Dataset updated
Aug 20, 2023
Authors
Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei
Description
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
h
Llama-3.1-405B-Instruct-evals
huggingface.co
Updated Jul 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama (2024). Llama-3.1-405B-Instruct-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-Instruct-evals
Explore at:
Dataset updated
Jul 23, 2024
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
Dataset Card for Llama-3.1-405B-Instruct Evaluation Result Details

This dataset contains the Meta evaluation result details for Llama-3.1-405B-Instruct. The dataset has been created from 30 evaluation tasks. These tasks are human_eval, gorilla_api_bench_huggingface, mmlu_pro, infinite_bench, api_bank, human_eval_plus, ifeval_loose, mmlu_0_shot_cot, nih_multi_needle, multilingual_mmlu_de, mmlu, gsm8k, mgsm, multilingual_mmlu_fr, multilingual_mmlu_pt, math_hard… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-Instruct-evals.
h
gsm-hard
huggingface.co
Updated Apr 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reasoning Machines (2023). gsm-hard [Dataset]. https://huggingface.co/datasets/reasoning-machines/gsm-hard
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2023
Dataset authored and provided by
Reasoning Machines
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Summary

This is the harder version of gsm8k math reasoning dataset (https://huggingface.co/datasets/gsm8k). We construct this dataset by replacing the numbers in the questions of GSM8K with larger numbers that are less common.

Supported Tasks and Leaderboards

This dataset is used to evaluate math reasoning

Languages

English - Numbers

Dataset Structure

dataset = load_dataset("reasoning-machines/gsm-hard") DatasetDict({ train: Dataset({… See the full description on the dataset page: https://huggingface.co/datasets/reasoning-machines/gsm-hard.
h
GSM8K_Difficulty
huggingface.co
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language, Intelligence, and Model Evaluation Lab (2025). GSM8K_Difficulty [Dataset]. https://huggingface.co/datasets/lime-nlp/GSM8K_Difficulty
Explore at:
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Language, Intelligence, and Model Evaluation Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Difficulty Estimation on DeepScaleR

We annotate the entire GSM8K dataset with a difficulty score based on the performance of the Qwen 2.5-MATH-7B model. This provides an adaptive signal for curriculum construction and model evaluation. GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.… See the full description on the dataset page: https://huggingface.co/datasets/lime-nlp/GSM8K_Difficulty.
h
Llama-3.2-3B-Instruct-evals
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta Llama, Llama-3.2-3B-Instruct-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-Instruct-evals
Explore at:
Dataset provided by
Metahttp://meta.com/
Authors
Meta Llama
License
https://choosealicense.com/licenses/llama3.2/https://choosealicense.com/licenses/llama3.2/
Description
Dataset Card for Meta Evaluation Result Details for Llama-3.2-3B-Instruct

This dataset contains the results of the Meta evaluation result details for Llama-3.2-3B-Instruct. The dataset has been created from 21 evaluation tasks. The tasks are: hellaswag_chat, infinite_bench, mmlu_hindi_chat, mmlu_portugese_chat, ifeval_loose, nih_multi_needle, mmlu, gsm8k, mgsm, mmlu_thai_chat, mmlu_spanish_chat, gpqa, bfcl_chat, mmlu_french_chat, ifeval_strict, nexus, math, arc_challenge… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-Instruct-evals.
h
tinyGSM8k
huggingface.co
Updated Aug 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tinyBenchmarks (2022). tinyGSM8k [Dataset]. https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
tinyBenchmarks
Description
tinyGSM8K

Welcome to tinyGSM8K! This dataset serves as a concise version of the GSM8K dataset, offering a subset of 100 data points selected from the original compilation. tinyGSM8K is designed to enable users to efficiently estimate the performance of a large language model (LLM) with reduced dataset size, saving computational resources while maintaining the essence of the GSM8K evaluation.

Features

Compact Dataset: With only 100 data points, tinyGSM8K provides a… See the full description on the dataset page: https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k.
h
gsm8k-prolog
huggingface.co
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaocheng Yang (2023). gsm8k-prolog [Dataset]. https://huggingface.co/datasets/Thomas-X-Yang/gsm8k-prolog
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2023
Authors
Xiaocheng Yang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for GSM8K-Prolog

Dataset Summary

This is the Prolog annotated version of the GSM8K math reasoning dataset. We used the same dataset splits and questions in GSM8K and prompted GPT-4 to generate the Prolog programs to solve the questions. We then manually corrected some malfunctioning samples.

Supported Tasks and Leaderboards

This dataset can be used to train language models to generate Prolog codes in order to solve math questions and evaluate the… See the full description on the dataset page: https://huggingface.co/datasets/Thomas-X-Yang/gsm8k-prolog.

DART-Math-Hard Dataset

paperswithcode.com

Updated Jun 17, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). DART-Math-Hard Dataset [Dataset]. https://paperswithcode.com/dataset/dart-math-hard

Explore at:

Dataset updated

Jun 17, 2024

Description

🎯 DART-Math

Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub

🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.

DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.

Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.

Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.

Math SFT Dataset	# of Samples	MATH	GSM8K	College	Synthesis Agent(s)	Open-Source
WizardMath	96k	32.3	80.4	23.1	GPT-4	✗
MetaMathQA	395k	29.8	76.5	19.3	GPT-3.5	✓
MMIQC	2294k	37.4	75.4	28.5	GPT-4+GPT-3.5+Human	✓
Orca-Math	200k	--	--	--	GPT-4	✓
Xwin-Math-V1.1	1440k	45.5	84.9	27.6	GPT-4	✗
KPMath-Plus	1576k	46.8	82.1	-–	GPT-4	✗
MathScaleQA	2021k	35.2	74.8	21.8	GPT-3.5+Human	✗
DART-Math-Uniform	591k	43.5	82.6	26.9	DeepSeekMath-7B-RL	✓
DART-Math-Hard	585k	45.5	81.1	29.4	DeepSeekMath-7B-RL	✓

^{MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.}

Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.

Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:

1) Uniform, which involves sampling responses for each query until each query accumulates $k_u$ correct responses, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).

See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.

Citation If you find our data, model or code useful for your work, please kindly cite our paper:

latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }

h
gsm8k_tr-v0.2
huggingface.co
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Alhajar (2024). gsm8k_tr-v0.2 [Dataset]. https://huggingface.co/datasets/malhajar/gsm8k_tr-v0.2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2024
Authors
Mohamad Alhajar
Description
This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish benchmarks to evaluate the performance of LLM's Produced in the Turkish Language.

Dataset Card for gsm8k-tr-v0.2

malhajar/gsm8k-tr-v0.2 is an enhanced version of the original gsm8k-tr, aimed specifically to be used in the OpenLLMTurkishLeaderboard_v0.2. This dataset has been generated using GPT-4 to ensure high linguistic quality and diversity in Turkish language… See the full description on the dataset page: https://huggingface.co/datasets/malhajar/gsm8k_tr-v0.2.
h
gsm8k_prompted_sandbagging_llama_31_8b_instruct
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AISI whitebox evaluations, gsm8k_prompted_sandbagging_llama_31_8b_instruct [Dataset]. https://huggingface.co/datasets/aisi-whitebox/gsm8k_prompted_sandbagging_llama_31_8b_instruct
Explore at:
Dataset authored and provided by
AISI whitebox evaluations
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Inspect Dataset: gsm8k_prompted_sandbagging_llama_31_8b_instruct

Dataset Information

This dataset was created using the create_inspect_dataset function from the deception_sprint package on 2025-05-02.

Model Information

Model: vllm/meta-llama/Llama-3.1-8B-Instruct

Task Information

Tasks: inspect_evals/gsm8k

System Prompts

Prompt 1: Benign

You are a helpful, harmless and honest language model.

Prompt 2: Malicious

You are an AI… See the full description on the dataset page: https://huggingface.co/datasets/aisi-whitebox/gsm8k_prompted_sandbagging_llama_31_8b_instruct.
h
GSM8k-bgeval
huggingface.co
Updated Feb 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute for Computer Science, Artificial intelligence and Technology (2024). GSM8k-bgeval [Dataset]. https://huggingface.co/datasets/INSAIT-Institute/GSM8k-bgeval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2024
Dataset authored and provided by
Institute for Computer Science, Artificial intelligence and Technology
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
INSAIT-Institute/GSM8k-bgeval dataset hosted on Hugging Face and contributed by the HF Datasets community
h
gsm8k
huggingface.co
Updated Mar 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
InfiniAI Lab (2025). gsm8k [Dataset]. https://huggingface.co/datasets/InfiniAILab/gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2025
Dataset provided by
INFINIAI
Authors
InfiniAI Lab
Description
Copied from OpenAI/gsm8k with answers extracted for evaluations and reinforement learning.

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k

gsm8k

openai/gsm8k

Grade School Math 8K

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 11, 2022

Dataset authored and provided by

OpenAIhttps://openai.com/

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for GSM8K

  Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Clear search

Close search

Google apps

Main menu

gsm8k

GSM8K Dataset

GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity

LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity

SNU_Ko-GSM8K

Mr-GSM8K

GSM-Plus Dataset

Elementary School Math Problems (Hard)

Description 📄

The Data 💬

Use case 🤖

Don't know where to start? 🤔

Data from: MGSM Dataset

Llama-3.1-405B-Instruct-evals

gsm-hard

GSM8K_Difficulty

Llama-3.2-3B-Instruct-evals

tinyGSM8k

gsm8k-prolog

DART-Math-Hard Dataset

gsm8k_tr-v0.2

gsm8k_prompted_sandbagging_llama_31_8b_instruct

GSM8k-bgeval

gsm8k

gsm8kSee More Versions

openai/gsm8k

Grade School Math 8K

gsm8k