Test-Time Attention Purification for Backdoored Large Vision Language Models

Test-Time Attention Purification for Backdoored Large Vision Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model’s utility on both clean and poisoned samples.


💡 Research Summary

This paper addresses the vulnerability of large vision‑language models (LVLMs) to backdoor attacks that are introduced during fine‑tuning. Existing defenses typically require retraining of the adapter or LoRA modules with clean data, which is computationally costly and often harms downstream performance. The authors uncover a new mechanistic insight: backdoor activation in LVLMs does not rely on low‑level visual patterns but on an abnormal redistribution of cross‑modal attention, a phenomenon they call “attention stealing.” When a trigger‑embedded image is processed, visual tokens that contain the trigger attract disproportionately high attention from the language model, thereby stealing attention from the textual prompt and steering the model toward the attacker‑specified output.

To exploit this insight, the authors propose CleanSight, a training‑free, plug‑and‑play defense that operates entirely at test time. CleanSight first detects poisoned inputs by measuring, for each attention head in a set of middle cross‑modal fusion layers, the ratio of attention paid to visual tokens versus prompt tokens (the visual‑to‑text attention ratio). These head‑level ratios are concatenated into a feature vector and compared against a clean reference distribution estimated from a small clean validation set. A whitened ℓ₂ distance is used to compute an anomaly score; inputs whose score exceeds a high quantile threshold are flagged as poisoned.

Once an input is flagged, CleanSight aggregates the visual tokens that receive abnormally high attention across heads and constructs a global mask. This mask is applied in subsequent layers and during decoding, effectively pruning the suspicious tokens while leaving the rest of the visual representation intact. Because the method does not modify model parameters, it is completely non‑intrusive and can be deployed as a runtime wrapper.

Extensive experiments on LLaVA and CLIP‑based LVLMs, using BadNet, blended, and global‑trigger attacks, demonstrate that CleanSight reduces attack success rates by more than 70 % while incurring less than a 2 % drop in clean accuracy. Compared with pixel‑level purification methods (e.g., transformation‑based defenses), CleanSight achieves 30–50 % higher mitigation effectiveness under the same attack strength. The paper also shows that head‑level attention ratios provide stronger discriminative power than averaged ratios, and that the middle layers (where cross‑modal fusion occurs) are the most informative for detection.

In addition to its defensive capability, CleanSight leverages insights from visual token pruning literature but repurposes pruning for security: instead of removing low‑attention tokens for efficiency, it removes high‑attention “trigger” tokens to break the backdoor’s attention hijacking. This novel use of attention manipulation establishes a practical, low‑overhead solution for real‑time LVLM services, where retraining is often infeasible. The authors conclude by suggesting future work on broader multimodal architectures, more sophisticated trigger designs, and integration with dynamic adapter‑level defenses.


Comments & Academic Discussion

Loading comments...

Leave a Comment