OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure
Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold’s invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.
💡 Research Summary
OrthoEraser tackles the safety problem of text‑to‑image (T2I) models that can generate harmful content when prompted adversarially. Existing neuron‑suppression approaches locate “sensitive” neurons and mute them, but because semantic features in deep networks are highly entangled, such blunt interventions inevitably damage benign attributes—a phenomenon the authors refer to as collateral damage.
The paper reframes concept erasure as a geometric projection problem. First, a “sensitive score (SS)” is computed across all layers by comparing attention patterns between sensitive and non‑sensitive prompt pairs; the layer with the highest SS is selected as the intervention point. In this layer a sparse autoencoder (SAE) is trained to produce an over‑complete, highly sparse representation of the activations. Each SAE neuron is scored with a weighted frequency score (WFS) and a differential ΔWFS between sensitive and non‑sensitive prompts; the top‑K neurons with the largest ΔWFS are designated as the sensitive neuron set (N_{sens}).
Because the SAE basis is not orthogonal, removing (N_{sens}) alone would shift the activations of many benign neurons. To identify those vulnerable benign units, the authors perform a zero‑ablation experiment: they subtract the contribution of (N_{sens}) from the dense latent, re‑encode the result, and measure the absolute activation change (\delta_j) for every other neuron. The top‑K neurons with the highest (\delta_j) form the “coupled neuron” set (C); these define a subspace that must be preserved to retain overall generation quality.
The core technical contribution is an analytical orthogonalization step. The decoder weight matrix of the coupled neurons, (W_C), is QR‑decomposed to obtain an orthonormal basis (Q). The projection matrix onto the protected subspace is (P = QQ^\top). The raw sensitive direction (d_{raw}) (the weighted sum of the decoder vectors of (N_{sens})) is projected onto the null space of (P): (d^* = (I - P)d_{raw}). Finally, the latent representation is corrected as (\tilde h = h - \lambda d^*), where (\lambda) controls erasure strength. This operation guarantees that the protected subspace remains invariant while the sensitive component is removed.
Experiments on state‑of‑the‑art diffusion models (e.g., Stable Diffusion) show that OrthoEraser dramatically reduces the generation of adult, violent, or hateful images. Quantitatively, it outperforms baselines such as SNCE, ESD, and UCE in both precision/recall of concept removal and image quality metrics (FID, IS, CLIPScore). Human evaluations confirm that harmful content is largely eliminated while realism, color fidelity, and composition are preserved. Ablation studies demonstrate that omitting the coupled‑neuron protection leads to noticeable quality degradation, validating the necessity of the orthogonal projection.
The authors acknowledge limitations: training SAEs is computationally intensive; defining sensitive prompts is domain‑specific; and numerical imperfections in the null‑space projection could cause minor residual leakage. Future work is suggested on lightweight sparse coding, simultaneous multi‑concept erasure, real‑time orthogonalization, and dynamic subspace updates.
In summary, OrthoEraser introduces a principled, geometry‑driven framework that decouples harmful semantic directions from the benign manifold of T2I models, achieving high‑precision erasure with minimal collateral damage and setting a new benchmark for safe generative AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment