100+ datasets found

h
reasoning
huggingface.co
Updated Apr 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Vila (2024). reasoning [Dataset]. https://huggingface.co/datasets/dvilasuero/reasoning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2024
Authors
Daniel Vila
Description
Dataset Card for reasoning

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/reasoning/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/reasoning.
h
natural_reasoning
huggingface.co
openaigptbot.com
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2025). natural_reasoning [Dataset]. https://huggingface.co/datasets/facebook/natural_reasoning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NaturalReasoning is a large-scale dataset for general reasoning tasks. It consists of high-quality challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The questions have been deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, MMLU-STEM. For each question, we extract the reference final answer from the original document from the pretraining corpora if possible. We also provide a model-generated response from… See the full description on the dataset page: https://huggingface.co/datasets/facebook/natural_reasoning.
facebook/natural_reasoning
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
Explore at:
zip(1694591016 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
Zehra Korkusuz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Natural Reasoning Dataset

Source: Huggingface

Dataset Overview

Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

Dataset Information

File Format: natural_reasoning.parquet

Click here to view the dataset

📝 License: CC-BY-NC-4.0

🧠 Task Categories: Text Generation Reasoning

🌐 Language: English (en)

📊 Dataset Size: 1M < n < 10M

📥 Source: Hugging Face

📄 Original Paper: NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

How to Use

You can load the dataset directly from Hugging Face as follows:

from datasets import load_dataset ds = load_dataset("facebook/natural_reasoning")

Data Collection and Quality

The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

Reference Answer Statistics

In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

Scaling Curve Performance

Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

Citation

If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

@misc{yuan2025naturalreasoningreasoningwild28m, title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions}, author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li}, year={2025}, eprint={2502.13124}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.13124} }

Source: Hugging Face
h
reasoning-0.01
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SkunkworksAI, reasoning-0.01 [Dataset]. https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
SkunkworksAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
reasoning-0.01 subset

synthetic dataset of reasoning chains for a wide variety of tasks. we leverage data like this across multiple reasoning experiments/projects. stay tuned for reasoning models and more data. Thanks to Hive Digital Technologies (https://x.com/HIVEDigitalTech) for their compute support in this project and beyond.
h
claude-3.7-sonnet-reasoning
huggingface.co
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reed Mayhew (2025). claude-3.7-sonnet-reasoning [Dataset]. https://huggingface.co/datasets/reedmayhew/claude-3.7-sonnet-reasoning
Explore at:
Dataset updated
Mar 11, 2025
Authors
Reed Mayhew
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
reedmayhew/claude-3.7-sonnet-reasoning dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ReMI
huggingface.co
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehran Kazemi (2024). ReMI [Dataset]. https://huggingface.co/datasets/mehrankazemi/ReMI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2024
Authors
Mehran Kazemi
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Dataset Description

ReMI was introduced in ReMI: A Dataset for Reasoning with Multiple Images. It contains 13 tasks namely: EmojiAlgebra, FuncRead, GeomShape, GeomCost, Collisions, Clocks, Schedule, Charts, CodeEdit, Isomorphism, Maps, RefCOCO, and IQ.

Dataset Usage Data Downloading

All the data examples were divided into two subsets: train and test.

train: contains 2 examples per task (26 in total) to be used as fewshot examples. test: contains 200 examples… See the full description on the dataset page: https://huggingface.co/datasets/mehrankazemi/ReMI.
h
medical-o1-reasoning-SFT
huggingface.co
Updated Apr 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FreedomAI (2025). medical-o1-reasoning-SFT [Dataset]. https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT
Explore at:
Dataset updated
Apr 22, 2025
Dataset authored and provided by
FreedomAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
News

[2025/04/22] We split the data and kept only the medical SFT dataset (medical_o1_sft.json). The file medical_o1_sft_mix.json contains a mix of medical and general instruction data. [2025/02/22] We released the distilled dataset from Deepseek-R1 based on medical verifiable problems. You can use it to initialize your models with the reasoning chain from Deepseek-R1. [2024/12/25] We open-sourced the medical reasoning dataset for SFT, built on medical verifiable problems and an LLM… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT.
h
MME-Reasoning
huggingface.co
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alpha-Innovator Lab (2025). MME-Reasoning [Dataset]. https://huggingface.co/datasets/U4R/MME-Reasoning
Explore at:
Dataset updated
May 23, 2025
Dataset authored and provided by
Alpha-Innovator Lab
Description
MME-Reasoning 🔥: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Official repository for "MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs". 🌟 For more details, please refer to the project page. [🚀Project Page] [📖 Paper] [🗃️ Github] [🏆 Leaderboard]

💥 News

[2025.05.23] 🔥 We launch MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs. We release the arxiv paper and all data samples… See the full description on the dataset page: https://huggingface.co/datasets/U4R/MME-Reasoning.
h
synthetic-reasoning-dataset-llama3-1
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KC (2025). synthetic-reasoning-dataset-llama3-1 [Dataset]. https://huggingface.co/datasets/Rhushya/synthetic-reasoning-dataset-llama3-1
Explore at:
Dataset updated
Jun 1, 2025
Authors
KC
Description
Rhushya/synthetic-reasoning-dataset-llama3-1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Maze-Reasoning
huggingface.co
Updated Feb 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menlo Research (2025). Maze-Reasoning [Dataset]. https://huggingface.co/datasets/Menlo/Maze-Reasoning
Explore at:
Dataset updated
Feb 7, 2025
Dataset authored and provided by
Menlo Research
Description
Menlo/Maze-Reasoning dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Multimodal-Visual-Reasoning-Dataset
huggingface.co
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chengliu (2025). Multimodal-Visual-Reasoning-Dataset [Dataset]. https://huggingface.co/datasets/lccccc-1/Multimodal-Visual-Reasoning-Dataset
Explore at:
Dataset updated
Apr 10, 2025
Authors
chengliu
Description
lccccc-1/Multimodal-Visual-Reasoning-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
python-reasoning-dataset
huggingface.co
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Han Díaz (2025). python-reasoning-dataset [Dataset]. https://huggingface.co/datasets/sdiazlor/python-reasoning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Authors
Sara Han Díaz
Description
Dataset Card for my-distiset-986461

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-986461/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/python-reasoning-dataset.
h
CodeIO-PyEdu-Reasoning
huggingface.co
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HKUST NLP Group (2025). CodeIO-PyEdu-Reasoning [Dataset]. https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2025
Dataset authored and provided by
HKUST NLP Group
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

📑 Paper | 🌐 Project Page | 💾 Released Resources | 📦 Repo

This is the resource page of the CodeI/O collection on Huggingface, we highlight your currect position with a blue block. Dataset

Dataset Link CodeI/O-PythonEdu-Reasoning 🤗

Please also check the raw data after our processing if you are interested:… See the full description on the dataset page: https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning.
h
Data from: visual-spatial-reasoning
huggingface.co
opendatalab.com
Updated Oct 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julen Etxaniz (2023). visual-spatial-reasoning [Dataset]. https://huggingface.co/datasets/juletxara/visual-spatial-reasoning
Explore at:
Dataset updated
Oct 6, 2023
Authors
Julen Etxaniz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
h
reasoning
huggingface.co
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LiveBench (2025). reasoning [Dataset]. https://huggingface.co/datasets/livebench/reasoning
Explore at:
Dataset updated
Mar 31, 2025
Dataset authored and provided by
LiveBench
Description
Dataset Card for "livebench/reasoning"

LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:

LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/reasoning.
reasoning-mix
huggingface.co
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EleutherAI (2025). reasoning-mix [Dataset]. https://huggingface.co/datasets/EleutherAI/reasoning-mix
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2025
Dataset authored and provided by
EleutherAIhttps://eleuther.ai/
Description
Shuffled mix of:

Large high dataset of quality web text: https://huggingface.co/datasets/EleutherAI/fineweb-edu-dedup-10b Medium dataset of QwQ math reasoning: https://huggingface.co/datasets/PrimeIntellect/NuminaMath-QwQ-CoT-5M Small dataset of DeepSeek-R1 reasoning traces on math, coding, science and puzzle data: https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k

Intended for disentanglement of advanced reasoning models (SAEs, transcoders). Generation code:… See the full description on the dataset page: https://huggingface.co/datasets/EleutherAI/reasoning-mix.
h
synthetic_reasoning
huggingface.co
Updated Oct 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unlimited Research Group of AI (2023). synthetic_reasoning [Dataset]. https://huggingface.co/datasets/ura-hcmut/synthetic_reasoning
Explore at:
Dataset updated
Oct 27, 2023
Dataset authored and provided by
Unlimited Research Group of AI
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Synthetic reasoning dataset

Original version:

https://huggingface.co/datasets/lighteval/synthetic_reasoning

Translation source code: https://github.com/martinakaduc/ura-llama/tree/main/dataset_scripts/custom_datasets
h
OpenThoughts-114k
huggingface.co
Updated Jan 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Thoughts (2025). OpenThoughts-114k [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k
Explore at:
Dataset updated
Jan 28, 2025
Dataset authored and provided by
Open Thoughts
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
[!NOTE] We have released a paper for OpenThoughts! See our paper here.

Open-Thoughts-114k

Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.

Available Subsets

default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.
h
MedCaseReasoning
huggingface.co
Updated Sep 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zou Lab @ Stanford University (2025). MedCaseReasoning [Dataset]. https://huggingface.co/datasets/zou-lab/MedCaseReasoning
Explore at:
Dataset updated
Sep 25, 2025
Dataset authored and provided by
Zou Lab @ Stanford University
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
zou-lab/MedCaseReasoning dataset hosted on Hugging Face and contributed by the HF Datasets community
h
PhysReason
huggingface.co
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhibei (2025). PhysReason [Dataset]. https://huggingface.co/datasets/zhibei1204/PhysReason
Explore at:
Dataset updated
May 30, 2025
Authors
Zhibei
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

PhysReason is accepted by ACL-2025-main

📋 Overview

PhysReason is a comprehensive physics-based reasoning benchmark consisting of 1,200 physics problems spanning multiple domains, with a focus on both knowledge-based (25%) and reasoning-based (75%) questions. This benchmark addresses the critical gap in evaluating large language models' capabilities in physics-based reasoning, which requires… See the full description on the dataset page: https://huggingface.co/datasets/zhibei1204/PhysReason.

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniel Vila (2024). reasoning [Dataset]. https://huggingface.co/datasets/dvilasuero/reasoning

reasoning

dvilasuero/reasoning

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 25, 2024

Authors

Daniel Vila

Description

Dataset Card for reasoning

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/reasoning/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/reasoning.

Clear search

Close search

Google apps

Main menu

reasoning

natural_reasoning

facebook/natural_reasoning

Natural Reasoning Dataset

Dataset Overview

Dataset Information

How to Use

Data Collection and Quality

Reference Answer Statistics

Scaling Curve Performance

Citation

reasoning-0.01

claude-3.7-sonnet-reasoning

ReMI

medical-o1-reasoning-SFT

MME-Reasoning

synthetic-reasoning-dataset-llama3-1

Maze-Reasoning

Multimodal-Visual-Reasoning-Dataset

python-reasoning-dataset

CodeIO-PyEdu-Reasoning

Data from: visual-spatial-reasoning

reasoning

reasoning-mix

synthetic_reasoning

OpenThoughts-114k

MedCaseReasoning

PhysReason

reasoningSee More Versions

dvilasuero/reasoning

reasoning