Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM’s cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.
💡 Research Summary
The paper introduces AttWarp, a test‑time image preprocessing technique that leverages the cross‑modal attention maps of multimodal large language models (MLLMs) to dynamically reallocate pixel density across an input image. Instead of modifying the model’s weights or architecture, AttWarp extracts a 2‑D attention score matrix from selected decoder layers and heads, upsamples it to the original resolution, and then collapses it into horizontal and vertical marginal attention profiles. These profiles are normalized into cumulative distribution functions (CDFs); the inverse CDFs define coordinate mappings that expand high‑attention regions and compress low‑attention ones through a rectilinear warping operation. The warped image retains a regular grid, preserving all original visual information while emphasizing task‑relevant details such as small objects or subtle spatial relations.
Three variants are explored. The basic AttWarp performs a single warp based on the initial attention map. AttWarp‑Chain iteratively applies the warp: after each iteration the model processes the newly warped image, produces an updated attention map, and a new warp is computed. Iterations stop when the KL‑divergence between successive attention distributions falls below a threshold, allowing the method to adaptively determine the needed warping strength for each query. AttWarp‑Distill addresses inference latency by training a lightweight predictor that directly outputs the marginal attention profiles from an image‑text pair. The predictor uses CLIP‑ViT visual tokens conditioned on the query via FiLM, followed by 1‑D convolutional heads that produce softmax‑normalized marginals. At inference time, the predictor replaces the expensive attention‑extraction pipeline, achieving roughly three‑fold speed‑up with negligible accuracy loss.
Extensive experiments cover five benchmarks—TextVQA, GQA, DocVQA, POPE, and MMMU—and four popular MLLM backbones: LLaVA, Qwen‑VL, InternVL, and InstructBLIP. Across all settings, AttWarp consistently improves accuracy (typically 3–5 percentage points on VQA tasks, >4 % on DocVQA), enhances compositional reasoning, and reduces hallucinations measured by POPE and MMMU. Compared to four strong baselines that manipulate raw images (simple cropping, uniform resizing, seam‑carving, and saliency‑based warps), AttWarp’s attention‑guided approach yields superior performance while preserving global layout.
Technical analysis shows that the method’s success hinges on the quality of the underlying attention maps; erroneous attention can misguide the warp and hurt performance. The rectilinear design ensures compatibility with standard vision encoders because the warped image remains on a regular pixel grid, avoiding the need for mesh‑based resampling. The authors also provide a rigorous ablation study confirming that (1) using marginal profiles rather than full 2‑D maps reduces computational cost without sacrificing fidelity, (2) the inverse‑CDF warping is essential for preserving distributional properties, and (3) the chain termination criterion reliably detects convergence.
Limitations include dependence on the MLLM’s attention reliability, increased memory usage for very high‑resolution inputs, and the current restriction to linear (axis‑aligned) warps. Future work may explore integrating more sophisticated, non‑linear mesh deformations, improving attention extraction via auxiliary supervision, and extending the approach to video or embodied‑agent settings where real‑time perception is critical.
In summary, AttWarp offers a lightweight, plug‑and‑play solution that improves fine‑grained visual grounding of existing multimodal LLMs by reshaping the input image itself according to query‑specific attention, achieving consistent gains without any model retraining.
Comments & Academic Discussion
Loading comments...
Leave a Comment