Facebook
TwitterDataset Card for GeoQA-train-Vision-R1-cot-rewrite
This dataset provides a rewritten version of the CoT (Chain-of-Thought) annotations for the GeoQA subset of the Vision-R1-cold dataset. It is designed to support efficient and structured multimodal reasoning with large language models.
Dataset Details
Dataset Description
The original Vision-R1 dataset, introduced in the paper Vision-R1: Reflective Multimodal Reasoning with Aha Moments, features detailed and… See the full description on the dataset page: https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
pt-sk/Vision-COT dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitter3D-CoT Benchmark: Chain-of-Thought Datasets for 3D Point Cloud-Language Models
Overview
The 3D-CoT Benchmark is a structured reasoning dataset designed explicitly to facilitate the systematic study of Chain-of-Thought (CoT)'s impact on 3D vision-language alignment. By extending existing 3D datasets with carefully structured reasoning annotations, this benchmark enables rigorous exploration of multimodal reasoning capabilities, significantly enhancing interpretability and… See the full description on the dataset page: https://huggingface.co/datasets/Battam/3D-CoT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How do newborns learn to see? We propose that visual systems are space-time fitters, meaning visual development can be understood as a blind fitting process (akin to evolution) in which visual systems gradually adapt to the spatiotemporal data distributions in the newborn’s environment. To test whether space-time fitting is a viable theory for learning how to see, we performed parallel controlled-rearing experiments on newborn chicks and deep neural networks (DNNs), including CNNs and transformers. First, we raised newborn chicks in impoverished environments containing a single object, then simulated those environments in a video game engine. Second, we recorded first-person images from agents moving through the virtual animal chambers and used those images to train DNNs. Third, we compared the viewpoint-invariant object recognition performance of the chicks and DNNs. When DNNs received the same visual diet (training data) as chicks, the models developed common object recognition skills as chicks. DNNs that used time as a teaching signal—space-time fitters—also showed common patterns of successes and failures across the test viewpoints as chicks. Thus, DNNs can learn object recognition in the same impoverished environments as newborn animals. We argue that space-time fitters can serve as formal scientific models of newborn visual systems, providing image-computable models for studying how newborns learn to see from raw visual experiences.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
🤗Data | 📄Paper | 🚀Code | 💻Model | 🔥Citation
Dataset Details
Dataset type: The LISA++ COT dataset is a QA dataset designed to train MLLM models for Visual COT and reasoning segmentation, enhancing the model's global understanding ability. It is based on the COCO2017 dataset. Where to send questions or comments about the dataset: https://github.com/dvlab-research/LISAPaper:… See the full description on the dataset page: https://huggingface.co/datasets/Senqiao/LISA_Plus_COT.
Facebook
TwitterWeThink-Multimodal-Reasoning-120K
Image Type
Images data can be access from https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
Image Type Source Dataset Images
General Images COCO 25,344
SAM-1B 18,091
Visual Genome 4,441
GQA 3,251
PISC 835
LLaVA 134
Text-Intensive Images TextVQA 25,483
ShareTextVQA 538
DocVQA 4,709
OCR-VQA5,142
ChartQA 21,781
Scientific & Technical GeoQA+ 4,813
ScienceQA 4,990
AI2D 1,812
CLEVR-Math 677… See the full description on the dataset page: https://huggingface.co/datasets/WeThink/WeThink-Multimodal-Reasoning-120K.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GameQA is a large-scale, diverse, and challenging multimodal reasoning dataset designed to enhance the general reasoning capabilities of Vision Language Models (VLMs). Generated using the innovative Code2Logic framework, it leverages game code to synthesize high-quality visual-language Chain-of-Thought (CoT) data. The dataset addresses the scarcity of multimodal reasoning data, critical for advancing complex multi-step reasoning in VLMs. Each sample includes visual game… See the full description on the dataset page: https://huggingface.co/datasets/Code2Logic/GameQA-140K.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
ViC-Bench
About ViC-Bench
Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly… See the full description on the dataset page: https://huggingface.co/datasets/meituan-longcat/ViC-Bench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SSR-CoT and SSRBench: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Paper: https://arxiv.org/abs/2505.12448 Project Page: https://yliu-cs.github.io/SSR/ Code: https://github.com/yliu-cs/SSR
Abstract
Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth… See the full description on the dataset page: https://huggingface.co/datasets/yliu-cs/SSRBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Athenea-VL: Advanced Multimodal Reasoning Dataset (Aquiles-ai/Athenea-VL)
Dataset Description
Athenea-VL is a comprehensive multimodal reasoning dataset designed for training vision-language models on complex scientific and analytical tasks. This dataset combines high-quality visual content with Chain-of-Thought (CoT) reasoning, making it ideal for developing models capable of step-by-step problem-solving across multiple domains.
Key Features
20,913… See the full description on the dataset page: https://huggingface.co/datasets/Aquiles-ai/Athenea-VL.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
UniVG-R1 Model Card
Model details
We propose UniVG-R1, a reasoning-guided MLLM for universal visual grounding, which leverages reinforcement learning to enhance reasoning across complex multi-image and multi-modal scenarios.
Dataset details
We provide three JSON files as follows:
revised_MIG_bench.json: which contains our revised version of the MIG_bench. stage1_cotsft.json: which contains the CoT-SFT data required for stage 1. stage2_rl.json: which… See the full description on the dataset page: https://huggingface.co/datasets/GD-ML/UniVG-R1-data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is the RL dataset for the paper: "ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL". ReasonGen-R1 is a two-stage framework that imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning (SFT) on a newly generated reasoning dataset of written rationales. It then refines its outputs using Group Relative Policy Optimization (GRPO). This dataset contains the model-crafted rationales paired with visual prompts… See the full description on the dataset page: https://huggingface.co/datasets/Franklin0/ReasonGen-R1-RL-Geneval-12k.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🌟 ReVisual-R1 (7B) — Open-Source Multimodal Reasoner
One cold-start, two RL stages, endless reasoning power.
🔑 Highlights
SOTA on 9 tough benchmarks covering visual–math + text reasoning.
Three-Stage SRO Training
Text Cold-Start — seed deep reflection Multimodal RL — align vision & logic Text RL — polish fluency & brevity
PAD (Prioritized Advantage Distillation) keeps gradients alive.
Efficient-Length Reward = concise, self-reflective CoT.
📚… See the full description on the dataset page: https://huggingface.co/datasets/csfufu/Grammer_dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterDataset Card for GeoQA-train-Vision-R1-cot-rewrite
This dataset provides a rewritten version of the CoT (Chain-of-Thought) annotations for the GeoQA subset of the Vision-R1-cold dataset. It is designed to support efficient and structured multimodal reasoning with large language models.
Dataset Details
Dataset Description
The original Vision-R1 dataset, introduced in the paper Vision-R1: Reflective Multimodal Reasoning with Aha Moments, features detailed and… See the full description on the dataset page: https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite.