Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled probing experiments, we observe that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a clear spatial bias in their semantic understanding. Further analysis indicates that this bias does not stem from the vision encoder, but rather from a mismatch in attention mechanisms between the vision encoder and the large language model, which disrupts the global information flow. Motivated by this insight, we propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that dynamically injects shared global visual context into each image token. AGCI works without architectural modifications, mitigating spatial bias by enhancing the semantic accessibility of image tokens while preserving the model’s intrinsic capabilities. Extensive experiments demonstrate that AGCI not only enhances the spatial robustness of LVLMs, but also achieves strong performance on various downstream tasks and hallucination benchmarks.

💡 Research Summary

The paper investigates a previously under‑explored weakness of large vision‑language models (LVLMs): spatial bias, i.e., the tendency of a model to produce different answers when the same visual content appears at different locations within an image. To quantify this phenomenon, the authors construct a probing dataset based on image‑text matching. They sample 10 000 image‑caption pairs from LAION, embed each key image together with eight distractor images into a 3 × 3 grid, and ask a binary “does any sub‑image match the caption?” question. By moving the key image to each of the nine grid positions, they generate 90 000 test instances while keeping all other visual and textual information constant.

Six representative LVLMs—Qwen2.5‑VL (7B, 32B, 72B), Qwen3‑VL‑8B‑Instruct, Gemma3n‑E4B‑it, LLaVA‑v1.6‑Mistral‑7B, and InternVL3‑8B—are evaluated in a zero‑shot setting. All models display noticeable performance fluctuations across positions; the most extreme case (LLaVA‑v1.6) shows up to a 15 % absolute drop in accuracy when the key image moves from the top‑left to the bottom‑right cell. Larger model variants tend to be more stable, but the bias never disappears completely.

To pinpoint the source of the bias, the authors decompose LVLM processing into two stages: (1) perception, where the vision encoder extracts low‑level features, and (2) semantic understanding, where the language model consumes the visual tokens for multimodal reasoning. Using eraser‑search (masking each image region and measuring logit changes), they demonstrate that the perception stage is robust: the model consistently identifies the critical region regardless of its location. Next, they compute cosine similarity between vision‑encoder outputs and the corresponding caption embeddings for 1 000 samples placed at different grid cells. The similarity remains essentially unchanged, indicating that the vision encoder’s semantic encoding is also position‑invariant.

These findings rule out the vision encoder as the culprit. The authors argue that the mismatch between the bidirectional self‑attention used in ViT‑style vision encoders and the causal (autoregressive) attention employed by large language models (LLMs) creates a “global information flow break.” In the vision encoder, every image token can attend to all others, forming a shared global context. When these tokens are fed into a causal LLM, earlier tokens cannot attend to later ones, so the global context is not uniformly available during cross‑modal reasoning. Consequently, the contribution of each image token becomes dependent on its sequential position, manifesting as spatial bias.

To mitigate this, they propose Adaptive Global Context Injection (AGCI), a lightweight, training‑free module that injects a shared global visual token into each image token based on semantic similarity. The procedure is: (1) summarize the entire set of visual tokens into a single global vector V_g (e.g., via average pooling); (2) compute cosine similarity s_i between each token v_i and V_g; (3) augment tokens with low similarity by adding a weighted copy of V_g: v’_i = v_i + α·(1‑s_i)·V_g, where α is a small scaling factor. Tokens already well‑aligned with the global context receive little or no modification, preserving useful information while strengthening under‑represented tokens. Importantly, AGCI does not alter the underlying architecture; it can be applied at inference time to any pretrained LVLM.

Extensive experiments validate AGCI. On the probing dataset, position‑wise accuracy variance drops by more than 70 % across all models, and the worst‑case accuracy gap is reduced to near‑zero for Qwen2.5‑VL‑7B and LLaVA‑v1.6. Beyond probing, AGCI is evaluated on six downstream benchmarks covering general VQA, OCR‑oriented tasks, and hallucination mitigation. In all cases, performance is either maintained or modestly improved (e.g., +3–5 % on hallucination benchmarks, +1.2 % on OCR accuracy). Larger models (Qwen2.5‑VL‑72B) benefit most, showing almost complete removal of spatial bias while retaining their strong baseline scores.

The paper’s contributions are threefold: (1) a systematic, large‑scale probing methodology that reveals spatial bias in LVLMs; (2) an analysis that attributes the bias to the causal attention mechanism of the language model rather than to the vision encoder; and (3) the AGCI mechanism, a simple yet effective remedy that restores global visual context without architectural changes and generalizes across tasks. By highlighting the importance of preserving global information flow when fusing vision and language, the work provides a clear design guideline for future multimodal systems and a practical tool for practitioners seeking more robust, trustworthy LVLM deployments.

Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment