ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.


💡 Research Summary

ViewFusion tackles a persistent weakness of current multimodal large language models (MLLMs): the inability to reliably align and reason over multiple visual viewpoints. Existing approaches typically treat each image as an independent evidence source and jump straight to answering, or they add a superficial “describe‑first” step that merely summarizes each view without establishing cross‑view spatial relationships. As a result, tasks that require understanding camera motion, object re‑identification across views, or occlusion changes suffer from brittle, shortcut‑driven behavior.

The proposed solution is a two‑stage “think‑twice” framework. In the first stage, the model generates a structured <spatial_thinking> trace that explicitly infers viewpoint relations, spatial transformations, and shared landmarks across all input images. This stage builds an intermediate workspace that encodes how the camera moved, which objects correspond, and how occlusions evolve—information that cannot be captured by simple per‑image captions. In the second stage, the model performs question‑driven reasoning () and produces the final answer () while conditioning on the workspace created in stage one. By forcing the model to align views before answering, the architecture eliminates the shortcut of solving the question with only a single view.

Training proceeds in two phases. First, a supervised fine‑tuning (SFT) dataset of 18 K synthetic multi‑view instances is constructed from VST‑500K and MindCube‑Trainset. Each instance is rewritten into a three‑part reasoning trace (<spatial_thinking>, , ) using a strong LLM (Qwen‑32B‑Instruct) and filtered for strict format compliance. This teaches the model the desired two‑stage generation pattern. Second, a reinforcement‑learning (RL) phase uses 16 K additional instances and applies Group Relative Policy Optimization (GRPO). GRPO allows separate reward signals for (a) correct spatial pre‑thinking (e.g., accurate viewpoint transformation) and (b) correct final answer, ensuring that the model does not drift back to single‑view shortcuts during policy optimization.

Evaluation on MMSI‑Bench—a benchmark specifically designed to probe multi‑view spatial intelligence—shows that ViewFusion improves overall accuracy by 5.3 percentage points over the strong baseline Qwen3‑VL‑4B‑Instruct. The gains are especially pronounced on items that truly require cross‑view alignment, with improvements of 9–12 pp. Comparisons against Qwen3‑VL‑4B‑Thinking (which encourages longer deliberation but not explicit alignment) demonstrate that the explicit two‑stage protocol yields superior performance. Ablation studies confirm that (i) removing the <spatial_thinking> stage hurts accuracy, (ii) training with SFT alone yields modest gains, and (iii) RL without the structured trace can cause the model to revert to shortcut behavior.

In summary, ViewFusion introduces a clear “observe‑align‑reason” pipeline that forces MLLMs to build a coherent spatial model before answering. This design not only boosts quantitative performance on challenging multi‑view tasks but also produces more interpretable reasoning traces, making it easier to diagnose and correct alignment errors. Future work could extend the approach to real‑time robotics perception, augmented‑reality applications, or integrate explicit 3‑D scene graph representations for even richer spatial reasoning.


Comments & Academic Discussion

Loading comments...

Leave a Comment