Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.

💡 Research Summary

Large Vision‑Language Models (LVLMs) have achieved impressive multimodal understanding and generation capabilities, yet they remain prone to hallucinations—especially object hallucinations where the model describes entities absent from the input image. Existing mitigation strategies largely operate at the output level (e.g., instruction fine‑tuning, contrastive decoding, external expert models) and provide limited insight into the internal visual representations that give rise to these errors.

This paper adopts a representation‑level perspective by inserting a Sparse Autoencoder (SAE) into the visual encoder of a state‑of‑the‑art LVLM (LLaVA‑1.5). The SAE (Matryoshka variant) encodes dense visual features into a high‑dimensional latent space (≈65 k dimensions) while enforcing a Top‑K sparsity constraint, so that only the K most active neurons are retained for each image. Visualization of the top‑activated neurons shows that each neuron corresponds to a semantically meaningful visual concept (e.g., “bow tie”, “cat”, “grass”).

Through systematic analysis the authors identify two distinct neuron families:

Always‑on neurons – a tiny set (≈10 out of 65 k) that appear in the top‑20 for virtually every image. These neurons encode low‑level attributes such as color or texture and have little semantic impact on downstream tasks.
Image‑specific neurons – thousands of neurons that fire selectively for particular objects or scene elements. Their activations are highly localized and directly influence the language generation component.

To probe the link between these neurons and hallucinations, the authors progressively add Gaussian noise to images and track changes in neuron activations using a Top‑K change ratio ΔK. As noise intensity grows, hallucination rates on the POPE benchmark rise sharply, and ΔK for image‑specific neurons increases dramatically, whereas always‑on neurons remain stable. This demonstrates that hallucinations are primarily driven by disruption of image‑specific neurons. Qualitative examples illustrate that noise‑induced weakening of relevant neurons (e.g., “camera”) leads the model to produce vague or incorrect descriptions.

Leveraging this insight, the paper introduces Contrastive Neuron Steering (CNS). CNS processes a clean image and its noisy counterpart simultaneously, computes a contrastive signal to identify neurons whose activations diverge between the two versions, and then:

Amplifies the identified image‑specific neurons (boosting their weights).
Suppresses neurons that become spuriously active only under noise.

Additionally, an Always‑on Neuron Suppression (ANS) module down‑weights the non‑informative always‑on neurons, sharpening the model’s focus on semantically grounded features. CNS operates at the prefilling stage—before the language model begins decoding—requiring only one extra forward pass through the visual encoder, making it computationally lightweight and fully compatible with any existing decoding‑stage hallucination mitigation technique.

Experimental Evaluation
The authors evaluate CNS on both hallucination‑focused benchmarks (POPE with COCO images) and standard multimodal tasks (VQAv2, COCO‑Caption, Flickr30K). Key findings include:

On POPE, CNS reduces hallucination rates by an average of 27 % while preserving accuracy and F1 scores (≤ 0.5 % drop).
On general benchmarks, CNS yields modest gains in image‑text alignment metrics and caption quality, confirming that strengthening image‑specific neurons does not harm overall understanding.
Ablation studies show that removing ANS degrades performance, highlighting the importance of suppressing always‑on signals. Varying the Top‑K sparsity and the number of contrastive samples influences the trade‑off between robustness and fidelity, underscoring the need for careful hyper‑parameter tuning.
When combined with existing contrastive decoding or external expert‑model methods, CNS provides additive improvements, demonstrating its complementary nature.

Discussion and Limitations
CNS relies on the presence of a well‑trained SAE; models with different visual backbones may require retraining the autoencoder. The current work focuses on static images; extending the approach to video, audio, or other modalities remains an open direction. Moreover, while CNS mitigates hallucinations effectively, it does not entirely eliminate them, suggesting that further research into deeper architectural changes may be beneficial.

Conclusion
By decomposing LVLM visual embeddings into sparse, interpretable neurons, the paper uncovers that hallucinations stem chiefly from instability in image‑specific neurons. The proposed Contrastive Neuron Steering method offers a principled, representation‑level intervention that robustly enhances visual grounding and curtails hallucinations without sacrificing overall multimodal performance. This work advances both the interpretability and safety of large vision‑language systems, paving the way for more trustworthy AI applications in safety‑critical domains.

Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment