AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36% AUC) and robust attack resilience (at least 88.61% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
💡 Research Summary
The paper addresses the problem of watermarking large vision‑language models (LVLMs) in a way that preserves visual fidelity while remaining detectable and robust. Existing watermarking techniques, originally designed for text‑only large language models, either ignore visual grounding (vision‑agnostic methods) or rely on a static, one‑time estimation of vision‑critical weights (vision‑specific methods). The former injects random biases that can produce visually irrelevant tokens, while the latter fails to adapt to the changing visual dependence during autoregressive generation and can introduce low‑quality tokens in the long tail of the vocabulary.
AGMark (Attention‑Guided Dynamic Watermarking) is proposed as a two‑stage framework that dynamically aligns watermark injection with visual semantics at each decoding step. The first stage, Semantic Critical Weight Extracting, computes a vision‑critical weight ψᵥₜ(k) for each token k by taking the dot product between the attention distribution over visual tokens (Aᵥₜ) and the projected visual embedding Eᵥ, normalized by cosine similarity. Simultaneously, a context‑critical weight ψ𝑐ₜ(k) is derived from the dot product between the current hidden state Eₕ and the token embedding. Both scores are standardized (z‑score) and combined via a convex combination controlled by ω, yielding a dynamic semantic‑critical weight ψₜ(k). After min‑max normalization, tokens are sorted descendingly to form a prioritized list V*ₜ; the top portion constitutes the “semantic‑critical tokens” for the current step.
The second stage, Adaptive Vocabulary Partitioning, determines how many of these critical tokens should be protected. It measures token uncertainty using normalized entropy H_normₜ (derived from the token probability distribution) and evaluates the density of the semantic‑critical weights by finding the smallest subset Sₜ whose cumulative importance exceeds a threshold τ, defining ρₜ = |Sₜ|/|V|. The proportion of tokens to protect is then ηₜ = α·ρₜ·(1−H_normₜ). Based on ηₜ, a set Cₜ of top‑ranked tokens is selected, and a swap operation moves tokens from the red list to the green list (and vice‑versa) to ensure that high‑importance tokens receive the positive logit bias δ used in logits‑based watermarking. Optional margin thresholds and per‑step caps prevent oscillations.
Experiments were conducted on three mainstream 8‑billion‑parameter LVLMs (LLaVA‑Next‑Llama3, Qwen3‑VL, InternVL‑3.5) using the AMBER and MS‑COCO benchmarks. Evaluation covered detection performance (AUC, accuracy), visual consistency (CHAIR), text quality (Perplexity, BLEU, BERTScore, STS), and robustness against five attacks (word insertion, deletion, synonym substitution, paraphrasing, translation). AGMark achieved at least 99.36 % AUC for detection, 88.61 % AUC under attacks, and improved CHAIR scores by an average of 1.7 % over baselines. Text quality metrics also showed gains, and inference latency increased by only 2–3 %, confirming practical feasibility.
In summary, AGMark introduces a principled way to fuse visual attention signals with textual context, dynamically calibrates the amount of watermarking based on uncertainty and evidence density, and thereby resolves the preservation‑detection trade‑off that has limited prior LVLM watermarking approaches. The method sets a new benchmark for reliable, high‑fidelity watermarking of multimodal generative models.
Comments & Academic Discussion
Loading comments...
Leave a Comment