Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
💡 Research Summary
The paper addresses a pressing safety issue in the rapidly emerging class of visual autoregressive (VAR) models for text‑to‑image generation. While concept erasure (CE) techniques have been extensively studied for diffusion‑based generators, those methods rely on aligning predicted noise across timesteps and cannot be directly transferred to VAR architectures, where the model predicts discrete visual tokens at progressively larger scales in a highly autoregressive manner. Directly applying diffusion‑style CE to VAR leads to cumulative errors across scales, causing severe degradation of image quality.
To bridge this methodological gap, the authors propose a two‑stage solution. First, they introduce the VAR Erasure framework (VARE), which augments the input of the visual transformer with auxiliary visual tokens generated by the original (pre‑fine‑tuned) model for both the unsafe prompt (c*) and a neutral prompt (c). By conditioning the fine‑tuned model on these reference tokens (denoted r_ori* and r_ori), the optimization problem is confined to adjusting only the cross‑attention responses that are specific to the unsafe concept, while preserving the overall token sequence generated at coarser scales. This dramatically reduces the search space and prevents the cascade of errors that would otherwise arise when the model tries to re‑generate the entire token hierarchy from scratch.
Building on VARE, the authors design S‑VARE, a surgical concept erasure method tailored to the discrete probability space of VAR models, particularly those employing the binary spherical quantization (BSQ) scheme introduced in the Infinity model. Instead of the mean‑squared error used in diffusion CE, S‑VARE employs a filtered cross‑entropy loss (L_FCE). The loss first applies a bit‑level filter: only bits whose classification confidence falls below a threshold γ (derived from binary classification accuracy) contribute to the loss. Then a token‑level filter excludes tokens whose proportion of incorrect bits is below α = 25 %, reflecting the self‑correction range used during Infinity training (0‑30 % erroneous bits are tolerated). The resulting mask F_i selectively penalizes only those token positions that are both unsafe and sufficiently uncertain, thereby avoiding over‑optimization of already correct tokens.
To counteract side‑effects commonly observed in naïve fine‑tuning—such as language drift (the model’s interpretation of prompts shifting) and reduced output diversity—the authors add a preservation loss (L_Pre). This loss aligns the outputs of the fine‑tuned model with those of the frozen original model for neutral concepts, ensuring that unrelated semantics remain stable and that the diversity of generated images is not compromised.
Extensive experiments are conducted on several benchmark unsafe concepts, including NSFW imagery, violent scenes, and copyrighted material. Quantitative results show that S‑VARE removes 97 % of targeted concepts while incurring less than a 2 % drop in CLIP‑based image quality scores, a substantial improvement over diffusion‑based CE baselines adapted to VAR, which suffer >10 % quality loss. Qualitative visualizations (e.g., token‑wise loss heatmaps across scales) demonstrate consistent optimization without the collapse observed in baseline methods. Comparisons with recent CE techniques such as FMN, ESD, and UCE further confirm that S‑VARE achieves superior concept removal while preserving style, composition, and overall fidelity.
In summary, the paper makes three key contributions: (1) it identifies and rigorously analyzes why existing diffusion‑centric CE methods fail for VAR models; (2) it proposes the VARE framework that leverages auxiliary visual tokens to stabilize fine‑tuning; and (3) it introduces S‑VARE, which combines a filtered cross‑entropy loss with a preservation loss to achieve surgical, low‑impact concept erasure. The work effectively closes the safety gap for autoregressive text‑to‑image generators, enabling safer deployment of high‑resolution, instruction‑following VAR systems without sacrificing generation quality or diversity. Future directions include scaling the approach to larger vocabularies, exploring automated detection of emerging unsafe concepts, and integrating the method into real‑time generation pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment