13 datasets found

h
GeoQA-train-Vision-R1-cot-rewrite
huggingface.co
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fanxing Bu (2025). GeoQA-train-Vision-R1-cot-rewrite [Dataset]. https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite
Explore at:
Dataset updated
Apr 21, 2025
Authors
Fanxing Bu
Description
Dataset Card for GeoQA-train-Vision-R1-cot-rewrite

This dataset provides a rewritten version of the CoT (Chain-of-Thought) annotations for the GeoQA subset of the Vision-R1-cold dataset. It is designed to support efficient and structured multimodal reasoning with large language models.

Dataset Details Dataset Description

The original Vision-R1 dataset, introduced in the paper Vision-R1: Reflective Multimodal Reasoning with Aha Moments, features detailed and… See the full description on the dataset page: https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite.
h
Vision-COT
huggingface.co
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sathish Kumar R (2024). Vision-COT [Dataset]. https://huggingface.co/datasets/pt-sk/Vision-COT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2024
Authors
Sathish Kumar R
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
pt-sk/Vision-COT dataset hosted on Hugging Face and contributed by the HF Datasets community
h
3D-CoT
huggingface.co
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanjun CHEN (2025). 3D-CoT [Dataset]. https://huggingface.co/datasets/Battam/3D-CoT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2025
Authors
Yanjun CHEN
Description
3D-CoT Benchmark: Chain-of-Thought Datasets for 3D Point Cloud-Language Models

Overview

The 3D-CoT Benchmark is a structured reasoning dataset designed explicitly to facilitate the systematic study of Chain-of-Thought (CoT)'s impact on 3D vision-language alignment. By extending existing 3D datasets with carefully structured reasoning annotations, this benchmark enables rigorous exploration of multimodal reasoning capabilities, significantly enhancing interpretability and… See the full description on the dataset page: https://huggingface.co/datasets/Battam/3D-CoT.
ViT-CoT with different architecture sizes.
plos.figshare.com
xls
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lalit Pandey; Donsuk Lee; Samantha M. W. Wood; Justin N. Wood (2024). ViT-CoT with different architecture sizes. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012600.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012600.t003
Dataset updated
Dec 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lalit Pandey; Donsuk Lee; Samantha M. W. Wood; Justin N. Wood
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How do newborns learn to see? We propose that visual systems are space-time fitters, meaning visual development can be understood as a blind fitting process (akin to evolution) in which visual systems gradually adapt to the spatiotemporal data distributions in the newborn’s environment. To test whether space-time fitting is a viable theory for learning how to see, we performed parallel controlled-rearing experiments on newborn chicks and deep neural networks (DNNs), including CNNs and transformers. First, we raised newborn chicks in impoverished environments containing a single object, then simulated those environments in a video game engine. Second, we recorded first-person images from agents moving through the virtual animal chambers and used those images to train DNNs. Third, we compared the viewpoint-invariant object recognition performance of the chicks and DNNs. When DNNs received the same visual diet (training data) as chicks, the models developed common object recognition skills as chicks. DNNs that used time as a teaching signal—space-time fitters—also showed common patterns of successes and failures across the test viewpoints as chicks. Thus, DNNs can learn object recognition in the same impoverished environments as newborn animals. We argue that space-time fitters can serve as formal scientific models of newborn visual systems, providing image-computable models for studying how newborns learn to see from raw visual experiences.
h
LISA_Plus_COT
huggingface.co
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Senqiao Yang (2024). LISA_Plus_COT [Dataset]. https://huggingface.co/datasets/Senqiao/LISA_Plus_COT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2024
Authors
Senqiao Yang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model

🤗Data | 📄Paper | 🚀Code | 💻Model | 🔥Citation

Dataset Details

Dataset type: The LISA++ COT dataset is a QA dataset designed to train MLLM models for Visual COT and reasoning segmentation, enhancing the model's global understanding ability. It is based on the COCO2017 dataset. Where to send questions or comments about the dataset: https://github.com/dvlab-research/LISAPaper:… See the full description on the dataset page: https://huggingface.co/datasets/Senqiao/LISA_Plus_COT.
h
WeThink-Multimodal-Reasoning-120K
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WeThink (2025). WeThink-Multimodal-Reasoning-120K [Dataset]. https://huggingface.co/datasets/WeThink/WeThink-Multimodal-Reasoning-120K
Explore at:
Dataset updated
May 28, 2025
Authors
WeThink
Description
WeThink-Multimodal-Reasoning-120K

Image Type

Images data can be access from https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k

Image Type Source Dataset Images

General Images COCO 25,344

SAM-1B 18,091

Visual Genome 4,441

GQA 3,251

PISC 835

LLaVA 134

Text-Intensive Images TextVQA 25,483

ShareTextVQA 538

DocVQA 4,709

OCR-VQA5,142

ChartQA 21,781

Scientific & Technical GeoQA+ 4,813

ScienceQA 4,990

AI2D 1,812

CLEVR-Math 677… See the full description on the dataset page: https://huggingface.co/datasets/WeThink/WeThink-Multimodal-Reasoning-120K.
h
GameQA-140K
huggingface.co
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Game-RL (2025). GameQA-140K [Dataset]. https://huggingface.co/datasets/Code2Logic/GameQA-140K
Explore at:
Dataset updated
May 29, 2025
Dataset authored and provided by
Game-RL
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

GameQA is a large-scale, diverse, and challenging multimodal reasoning dataset designed to enhance the general reasoning capabilities of Vision Language Models (VLMs). Generated using the innovative Code2Logic framework, it leverages game code to synthesize high-quality visual-language Chain-of-Thought (CoT) data. The dataset addresses the scarcity of multimodal reasoning data, critical for advancing complex multi-step reasoning in VLMs. Each sample includes visual game… See the full description on the dataset page: https://huggingface.co/datasets/Code2Logic/GameQA-140K.
h
ViC-Bench
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LongCat (2025). ViC-Bench [Dataset]. https://huggingface.co/datasets/meituan-longcat/ViC-Bench
Explore at:
Dataset updated
May 31, 2025
Dataset authored and provided by
LongCat
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
ViC-Bench

About ViC-Bench

Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly… See the full description on the dataset page: https://huggingface.co/datasets/meituan-longcat/ViC-Bench.
h
SSRBench
huggingface.co
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Liu (2025). SSRBench [Dataset]. https://huggingface.co/datasets/yliu-cs/SSRBench
Explore at:
Dataset updated
Jul 7, 2025
Authors
Yang Liu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SSR-CoT and SSRBench: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Paper: https://arxiv.org/abs/2505.12448 Project Page: https://yliu-cs.github.io/SSR/ Code: https://github.com/yliu-cs/SSR

Abstract

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth… See the full description on the dataset page: https://huggingface.co/datasets/yliu-cs/SSRBench.
h
Athenea-VL
huggingface.co
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aquiles-ai (2025). Athenea-VL [Dataset]. https://huggingface.co/datasets/Aquiles-ai/Athenea-VL
Explore at:
Dataset updated
Dec 2, 2025
Dataset authored and provided by
Aquiles-ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Athenea-VL: Advanced Multimodal Reasoning Dataset (Aquiles-ai/Athenea-VL)

Dataset Description

Athenea-VL is a comprehensive multimodal reasoning dataset designed for training vision-language models on complex scientific and analytical tasks. This dataset combines high-quality visual content with Chain-of-Thought (CoT) reasoning, making it ideal for developing models capable of step-by-step problem-solving across multiple domains.

Key Features

20,913… See the full description on the dataset page: https://huggingface.co/datasets/Aquiles-ai/Athenea-VL.
h
UniVG-R1-data
huggingface.co
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMAP-ML (2025). UniVG-R1-data [Dataset]. https://huggingface.co/datasets/GD-ML/UniVG-R1-data
Explore at:
Dataset updated
May 26, 2025
Dataset authored and provided by
AMAP-ML
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
UniVG-R1 Model Card

Model details

We propose UniVG-R1, a reasoning-guided MLLM for universal visual grounding, which leverages reinforcement learning to enhance reasoning across complex multi-image and multi-modal scenarios.

Dataset details

We provide three JSON files as follows:

revised_MIG_bench.json: which contains our revised version of the MIG_bench. stage1_cotsft.json: which contains the CoT-SFT data required for stage 1. stage2_rl.json: which… See the full description on the dataset page: https://huggingface.co/datasets/GD-ML/UniVG-R1-data.
h
ReasonGen-R1-RL-Geneval-12k
huggingface.co
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franklin Zhang (2025). ReasonGen-R1-RL-Geneval-12k [Dataset]. https://huggingface.co/datasets/Franklin0/ReasonGen-R1-RL-Geneval-12k
Explore at:
Dataset updated
Jun 5, 2025
Authors
Franklin Zhang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is the RL dataset for the paper: "ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL". ReasonGen-R1 is a two-stage framework that imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning (SFT) on a newly generated reasoning dataset of written rationales. It then refines its outputs using Group Relative Policy Optimization (GRPO). This dataset contains the model-crafted rationales paired with visual prompts… See the full description on the dataset page: https://huggingface.co/datasets/Franklin0/ReasonGen-R1-RL-Geneval-12k.
h
Grammer_dataset
huggingface.co
Updated Oct 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawn (2025). Grammer_dataset [Dataset]. https://huggingface.co/datasets/csfufu/Grammer_dataset
Explore at:
Dataset updated
Oct 9, 2025
Authors
Shawn
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🌟 ReVisual-R1 (7B) — Open-Source Multimodal Reasoner

One cold-start, two RL stages, endless reasoning power.

🔑 Highlights

SOTA on 9 tough benchmarks covering visual–math + text reasoning.

Three-Stage SRO Training

Text Cold-Start — seed deep reflection Multimodal RL — align vision & logic Text RL — polish fluency & brevity

PAD (Prioritized Advantage Distillation) keeps gradients alive.

Efficient-Length Reward = concise, self-reflective CoT.

📚… See the full description on the dataset page: https://huggingface.co/datasets/csfufu/Grammer_dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Fanxing Bu (2025). GeoQA-train-Vision-R1-cot-rewrite [Dataset]. https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite

GeoQA-train-Vision-R1-cot-rewrite

LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite

Explore at:

Dataset updated

Apr 21, 2025

Authors

Fanxing Bu

Description

Dataset Card for GeoQA-train-Vision-R1-cot-rewrite

This dataset provides a rewritten version of the CoT (Chain-of-Thought) annotations for the GeoQA subset of the Vision-R1-cold dataset. It is designed to support efficient and structured multimodal reasoning with large language models.

  Dataset Details





  Dataset Description

The original Vision-R1 dataset, introduced in the paper Vision-R1: Reflective Multimodal Reasoning with Aha Moments, features detailed and… See the full description on the dataset page: https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite.

Clear search

Close search

Google apps

Main menu

GeoQA-train-Vision-R1-cot-rewrite

Vision-COT

3D-CoT

ViT-CoT with different architecture sizes.

LISA_Plus_COT

WeThink-Multimodal-Reasoning-120K

GameQA-140K

ViC-Bench

SSRBench

Athenea-VL

UniVG-R1-data

ReasonGen-R1-RL-Geneval-12k

Grammer_dataset

GeoQA-train-Vision-R1-cot-rewriteSee More Versions

LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite

GeoQA-train-Vision-R1-cot-rewrite