Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu

Recommendation Score

breakthrough🟡 IntermediateReasoning & Agents LLM ReasoningBenchmarkBest for researchers

Research context

Primary field

Reasoning & Agents

Reasoning, planning, tool use, and agentic workflows.

Topics

LLM Reasoning

Paper type

Benchmark

Best for

Best for researchers

arXiv categories

cs.CVcs.AIcs.CV

Why It Matters

Reveals CoT prompting harms visual spatial reasoning in multimodal LLMs—forcing a rethink of reasoning paradigms in robotics, AR/VR, and vision-language systems where spatial accuracy is non-negotiable.

Abstract

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

More in Reasoning & Agents → More on LLM Reasoning →

View on arXiv → Download PDF →

Published April 17, 2026