40 datasets found
  1. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  2. P

    GSM8K Dataset

    • paperswithcode.com
    • tensorflow.org
    • +2more
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
    Explore at:
    Dataset updated
    Dec 31, 2024
    Authors
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
    Description

    GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

  3. h

    GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity

    • huggingface.co
    Updated Jan 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyan C (2025). GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity [Dataset]. https://huggingface.co/datasets/thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2025
    Authors
    Shreyan C
    Description

    thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity

    • huggingface.co
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyan C (2025). LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity [Dataset]. https://huggingface.co/datasets/thethinkmachine/LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2025
    Authors
    Shreyan C
    Description

    thethinkmachine/LlamaGemma-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    SNU_Ko-GSM8K

    • huggingface.co
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    THUNDER Research Group (2025). SNU_Ko-GSM8K [Dataset]. https://huggingface.co/datasets/thunder-research-group/SNU_Ko-GSM8K
    Explore at:
    Dataset updated
    Jun 13, 2025
    Dataset authored and provided by
    THUNDER Research Group
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Note: Evaluation code for each benchmark dataset is under preparation and will be released soon to support standardized model assessment.

      Dataset Card for Ko-GSM8K
    
    
    
    
    
      Dataset Summary
    

    Ko-GSM8K is a Korean adaptation of the GSM8K dataset, which is composed of high-quality grade school-level math word problems. Each problem requires multi-step reasoning involving basic arithmetic operations. The Korean version translates and localizes each item with careful attention to… See the full description on the dataset page: https://huggingface.co/datasets/thunder-research-group/SNU_Ko-GSM8K.

  6. h

    Mr-GSM8K

    • huggingface.co
    Updated Dec 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Randolphzeng (2023). Mr-GSM8K [Dataset]. https://huggingface.co/datasets/Randolphzeng/Mr-GSM8K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2023
    Authors
    Randolphzeng
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    View the project page: https://github.com/dvlab-research/DiagGSM8K see our paper at https://arxiv.org/abs/2312.17080

      Description
    

    In this work, we introduce a novel evaluation paradigm for Large Language Models, one that challenges them to engage in meta-reasoning. Our paradigm shifts the focus from result-oriented assessments, which often overlook the reasoning process, to a more holistic evaluation that effectively differentiates the cognitive capabilities among models. For… See the full description on the dataset page: https://huggingface.co/datasets/Randolphzeng/Mr-GSM8K.

  7. P

    GSM-Plus Dataset

    • paperswithcode.com
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qintong Li; Leyang Cui; Xueliang Zhao; Lingpeng Kong; Wei Bi (2025). GSM-Plus Dataset [Dataset]. https://paperswithcode.com/dataset/gsm-plus
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Qintong Li; Leyang Cui; Xueliang Zhao; Lingpeng Kong; Wei Bi
    Description

    By perturbing the widely used GSM8K dataset, an adversarial dataset for grade-school math called GSM-Plus is created. Motivated by the capability taxonomy for solving math problems mentioned in Polya's principles, this paper identifies 5 perspectives to guide the development of GSM-Plus:

    Numerical Variation refers to altering the numerical data or its types, including 3 subcategories: Numerical Substitution, Digit Expansion, and Integer-decimal-fraction Conversion. Arithmetic Variation refers to reversing or introducing additional operations (e.g., addition, subtraction, multiplication, and division) to math problems, including 2 subcategories: Adding Operation and Reversing Operation. Problem Understanding refers to rephrasing the text description of the math problems. Distractor Insertion refers to inserting topic-related but useless sentences to the problems. Critical Thinking focuses on question or doubt ability when the question lacks necessary statements.

    GSM-Plus can be used to evaluate the robustness of current LLMs in mathematical reasoning.

  8. Elementary School Math Problems (Hard)

    • kaggle.com
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chagin (2024). Elementary School Math Problems (Hard) [Dataset]. https://www.kaggle.com/datasets/chayanonc/gsm-hard-elementary-school-math-problems
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Kaggle
    Authors
    Chagin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description 📄

    A diverse set of grade-school math problems aiming to evaluate and improve the mathematical reasoning abilities of language models. This dataset is a more challenging variant of the GSM8K. It was modified by substituting the numbers with larger, less frequently encountered values.

    The Data 💬

    • Format: DataFrame with following features

    • Input: (Text) Math question

      Henry made two stops during his 60-mile bike trip. He first stopped after 20 miles. His second stop was 15 miles before the end of the trip. How many miles did he travel between his first and second stops?

    • Target: (Number) The answer to the corresponding question

      25

    Use case 🤖

    • Math reasoning: Train your model to solve various math problems.

    Don't know where to start? 🤔

    Get start with our introduction notebook, where we utilize GPT (Generative-Pretrained-Transformers) and PAL (program-aided language models) to solve math problems step-by-step!!

    HAPPY KAGGLING !!

  9. P

    Data from: MGSM Dataset

    • paperswithcode.com
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei (2023). MGSM Dataset [Dataset]. https://paperswithcode.com/dataset/mgsm
    Explore at:
    Dataset updated
    Aug 20, 2023
    Authors
    Freda Shi; Mirac Suzgun; Markus Freitag; Xuezhi Wang; Suraj Srivats; Soroush Vosoughi; Hyung Won Chung; Yi Tay; Sebastian Ruder; Denny Zhou; Dipanjan Das; Jason Wei
    Description

    Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

  10. h

    Llama-3.1-405B-Instruct-evals

    • huggingface.co
    Updated Jul 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama (2024). Llama-3.1-405B-Instruct-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-Instruct-evals
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/

    Description

    Dataset Card for Llama-3.1-405B-Instruct Evaluation Result Details

    This dataset contains the Meta evaluation result details for Llama-3.1-405B-Instruct. The dataset has been created from 30 evaluation tasks. These tasks are human_eval, gorilla_api_bench_huggingface, mmlu_pro, infinite_bench, api_bank, human_eval_plus, ifeval_loose, mmlu_0_shot_cot, nih_multi_needle, multilingual_mmlu_de, mmlu, gsm8k, mgsm, multilingual_mmlu_fr, multilingual_mmlu_pt, math_hard… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.1-405B-Instruct-evals.

  11. h

    gsm-hard

    • huggingface.co
    Updated Apr 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reasoning Machines (2023). gsm-hard [Dataset]. https://huggingface.co/datasets/reasoning-machines/gsm-hard
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2023
    Dataset authored and provided by
    Reasoning Machines
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Summary

    This is the harder version of gsm8k math reasoning dataset (https://huggingface.co/datasets/gsm8k). We construct this dataset by replacing the numbers in the questions of GSM8K with larger numbers that are less common.

      Supported Tasks and Leaderboards
    

    This dataset is used to evaluate math reasoning

      Languages
    

    English - Numbers

      Dataset Structure
    

    dataset = load_dataset("reasoning-machines/gsm-hard") DatasetDict({ train: Dataset({… See the full description on the dataset page: https://huggingface.co/datasets/reasoning-machines/gsm-hard.

  12. h

    GSM8K_Difficulty

    • huggingface.co
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language, Intelligence, and Model Evaluation Lab (2025). GSM8K_Difficulty [Dataset]. https://huggingface.co/datasets/lime-nlp/GSM8K_Difficulty
    Explore at:
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Language, Intelligence, and Model Evaluation Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Difficulty Estimation on DeepScaleR

    We annotate the entire GSM8K dataset with a difficulty score based on the performance of the Qwen 2.5-MATH-7B model. This provides an adaptive signal for curriculum construction and model evaluation. GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.… See the full description on the dataset page: https://huggingface.co/datasets/lime-nlp/GSM8K_Difficulty.

  13. h

    Llama-3.2-3B-Instruct-evals

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta Llama, Llama-3.2-3B-Instruct-evals [Dataset]. https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-Instruct-evals
    Explore at:
    Dataset provided by
    Metahttp://meta.com/
    Authors
    Meta Llama
    License

    https://choosealicense.com/licenses/llama3.2/https://choosealicense.com/licenses/llama3.2/

    Description

    Dataset Card for Meta Evaluation Result Details for Llama-3.2-3B-Instruct

    This dataset contains the results of the Meta evaluation result details for Llama-3.2-3B-Instruct. The dataset has been created from 21 evaluation tasks. The tasks are: hellaswag_chat, infinite_bench, mmlu_hindi_chat, mmlu_portugese_chat, ifeval_loose, nih_multi_needle, mmlu, gsm8k, mgsm, mmlu_thai_chat, mmlu_spanish_chat, gpqa, bfcl_chat, mmlu_french_chat, ifeval_strict, nexus, math, arc_challenge… See the full description on the dataset page: https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-Instruct-evals.

  14. h

    tinyGSM8k

    • huggingface.co
    Updated Aug 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tinyBenchmarks (2022). tinyGSM8k [Dataset]. https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    tinyBenchmarks
    Description

    tinyGSM8K

    Welcome to tinyGSM8K! This dataset serves as a concise version of the GSM8K dataset, offering a subset of 100 data points selected from the original compilation. tinyGSM8K is designed to enable users to efficiently estimate the performance of a large language model (LLM) with reduced dataset size, saving computational resources while maintaining the essence of the GSM8K evaluation.

      Features
    

    Compact Dataset: With only 100 data points, tinyGSM8K provides a… See the full description on the dataset page: https://huggingface.co/datasets/tinyBenchmarks/tinyGSM8k.

  15. h

    gsm8k-prolog

    • huggingface.co
    Updated Sep 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaocheng Yang (2023). gsm8k-prolog [Dataset]. https://huggingface.co/datasets/Thomas-X-Yang/gsm8k-prolog
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2023
    Authors
    Xiaocheng Yang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K-Prolog

      Dataset Summary
    

    This is the Prolog annotated version of the GSM8K math reasoning dataset. We used the same dataset splits and questions in GSM8K and prompted GPT-4 to generate the Prolog programs to solve the questions. We then manually corrected some malfunctioning samples.

      Supported Tasks and Leaderboards
    

    This dataset can be used to train language models to generate Prolog codes in order to solve math questions and evaluate the… See the full description on the dataset page: https://huggingface.co/datasets/Thomas-X-Yang/gsm8k-prolog.

  16. P

    DART-Math-Hard Dataset

    • paperswithcode.com
    Updated Jun 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). DART-Math-Hard Dataset [Dataset]. https://paperswithcode.com/dataset/dart-math-hard
    Explore at:
    Dataset updated
    Jun 17, 2024
    Description

    🎯 DART-Math

    Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

    📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub

    🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

    Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.

    DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.

    Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.

    Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.

    Math SFT Dataset# of SamplesMATHGSM8KCollegeSynthesis Agent(s)Open-Source
    WizardMath96k32.380.423.1GPT-4
    MetaMathQA395k29.876.519.3GPT-3.5
    MMIQC2294k37.475.428.5GPT-4+GPT-3.5+Human
    Orca-Math200k------GPT-4
    Xwin-Math-V1.11440k45.584.927.6GPT-4
    KPMath-Plus1576k46.882.1-–GPT-4
    MathScaleQA2021k35.274.821.8GPT-3.5+Human
    DART-Math-Uniform591k43.582.626.9DeepSeekMath-7B-RL
    DART-Math-Hard585k45.581.129.4DeepSeekMath-7B-RL

    MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.

    Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.

    Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:

    1) Uniform, which involves sampling responses for each query until each query accumulates $k_u$ correct responses, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).

    See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.

    Citation If you find our data, model or code useful for your work, please kindly cite our paper:

    latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }

  17. h

    gsm8k_tr-v0.2

    • huggingface.co
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Alhajar (2024). gsm8k_tr-v0.2 [Dataset]. https://huggingface.co/datasets/malhajar/gsm8k_tr-v0.2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2024
    Authors
    Mohamad Alhajar
    Description

    This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish benchmarks to evaluate the performance of LLM's Produced in the Turkish Language.

      Dataset Card for gsm8k-tr-v0.2
    

    malhajar/gsm8k-tr-v0.2 is an enhanced version of the original gsm8k-tr, aimed specifically to be used in the OpenLLMTurkishLeaderboard_v0.2. This dataset has been generated using GPT-4 to ensure high linguistic quality and diversity in Turkish language… See the full description on the dataset page: https://huggingface.co/datasets/malhajar/gsm8k_tr-v0.2.

  18. h

    gsm8k_prompted_sandbagging_llama_31_8b_instruct

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AISI whitebox evaluations, gsm8k_prompted_sandbagging_llama_31_8b_instruct [Dataset]. https://huggingface.co/datasets/aisi-whitebox/gsm8k_prompted_sandbagging_llama_31_8b_instruct
    Explore at:
    Dataset authored and provided by
    AISI whitebox evaluations
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Inspect Dataset: gsm8k_prompted_sandbagging_llama_31_8b_instruct

      Dataset Information
    

    This dataset was created using the create_inspect_dataset function from the deception_sprint package on 2025-05-02.

      Model Information
    

    Model: vllm/meta-llama/Llama-3.1-8B-Instruct

      Task Information
    

    Tasks: inspect_evals/gsm8k

      System Prompts
    

    Prompt 1: Benign

    You are a helpful, harmless and honest language model.

    Prompt 2: Malicious

    You are an AI… See the full description on the dataset page: https://huggingface.co/datasets/aisi-whitebox/gsm8k_prompted_sandbagging_llama_31_8b_instruct.

  19. h

    GSM8k-bgeval

    • huggingface.co
    Updated Feb 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Computer Science, Artificial intelligence and Technology (2024). GSM8k-bgeval [Dataset]. https://huggingface.co/datasets/INSAIT-Institute/GSM8k-bgeval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2024
    Dataset authored and provided by
    Institute for Computer Science, Artificial intelligence and Technology
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    INSAIT-Institute/GSM8k-bgeval dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    gsm8k

    • huggingface.co
    Updated Mar 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    InfiniAI Lab (2025). gsm8k [Dataset]. https://huggingface.co/datasets/InfiniAILab/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2025
    Dataset provided by
    INFINIAI
    Authors
    InfiniAI Lab
    Description

    Copied from OpenAI/gsm8k with answers extracted for evaluations and reinforement learning.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
Organization logo

gsm8k

openai/gsm8k

Grade School Math 8K

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for GSM8K

  Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Search
Clear search
Close search
Google apps
Main menu