Same Answer, Different Representations: Hidden instability in VLMs

Same Answer, Different Representations: Hidden instability in VLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.


💡 Research Summary

This paper challenges the prevailing assumption that output‑level stability is sufficient to judge the robustness of Vision‑Language Models (VLMs). The authors argue that VLMs can keep the same answer even when their internal multimodal representations shift dramatically—a phenomenon they call “representation drift.” To expose this hidden instability, they introduce a comprehensive evaluation framework that goes beyond label accuracy and measures four internal‑level diagnostics: (1) Embedding Stability, which computes cosine distance and L2 norm between base and perturbed embeddings at five strategic positions (context vs. answer, open‑ended vs. multiple‑choice prompts); (2) Structural Smoothness, quantified by Dirichlet Energy, capturing how much adjacent vision tokens diverge after a perturbation; (3) Perturbation‑vs‑Control Drift, expressed as Cohen’s d between perturbation‑induced drift and the natural inter‑image variability; and (4) Drift‑to‑Prior, evaluated on the POPE hallucination benchmark to see whether perturbations push the model toward its language‑only prior.

The authors apply this framework to several state‑of‑the‑art VLM families (Qwen‑3‑VL and LLaVA) across three benchmark suites: SEED‑Bench (visual reasoning), MMMU (multi‑image reasoning), and POPE (object‑existence hallucination). They test six families of natural, meaning‑preserving transformations—translation, padding/cropping, scaling, rotation, and three variants of text overlays (semantic adversarial text, random strings, empty boxes). Each perturbation is sampled over a range of parameters, and evaluation uses a log‑likelihood‑based multiple‑choice scoring protocol that yields confidence margins in addition to binary correctness.

Key findings are threefold. First, there is a substantial disconnect between output stability and internal consistency. While the overall instance flip rate (IFR) across all perturbations is about 37 % of images, many cases show unchanged answers despite embedding drifts that are comparable to the distance between unrelated images. Text overlays are especially disruptive, producing the highest IFR (≈19 %) and causing embedding drifts that approach inter‑image variability. Second, model scale does not mitigate this hidden instability. Larger models achieve higher base accuracies (e.g., 61 % → 71 % on SEED‑Bench) but exhibit equal or greater representation drift and sharper declines in confidence margins, indicating more fragile decision boundaries. Third, the impact of perturbations varies by task type. In reasoning tasks, perturbations that disturb the integration of coarse (low‑frequency) and fine (high‑frequency) visual cues lead to more random errors and reduced margins. Conversely, on hallucination tasks, the same perturbations make models more conservative, lowering false‑positive rates and shifting predictions toward the language‑only prior.

Spectral analysis reveals that geometric transformations preserve overall frequency magnitude but scramble phase, which is crucial for spatial structure. This phase misalignment inflates Dirichlet Energy, confirming that perturbations inject high‑frequency structural noise. Text overlays add explicit high‑frequency components, further destabilizing token‑level coherence.

The authors conclude that robust VLM evaluation must incorporate representation‑level metrics; relying solely on output invariance can mask severe internal volatility that may surface under downstream or compounded perturbations. Their framework provides a diagnostic toolkit for researchers to pinpoint where instability arises—whether in visual‑context encoding, answer generation, or token‑level smoothness—and suggests that future model design should prioritize spectral stability, token coherence, and regularization strategies that align internal representations across benign transformations. This work thus opens a new direction for assessing and improving the true robustness of multimodal AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment