$ extbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

$	extbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at https://github.com/TiezMind/AGT-unlearning.


💡 Research Summary

Large language models (LLMs) inevitably memorize portions of their training data, which can include private, copyrighted, or harmful information. Existing machine‑unlearning approaches face a fundamental trade‑off: aggressive parameter updates erase the target data but cause catastrophic forgetting of general knowledge, while conservative updates only mask the data, leaving it recoverable through adversarial probing. To resolve this dilemma, the authors introduce AGT⁽ᴬᴼ⁾ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework that simultaneously safeguards model utility and achieves robust erasure.
The framework consists of two complementary components. First, Adaptive Orthogonality (AO) measures the cosine similarity between the gradient of the forget loss (g_f) and the gradient of the retain loss (g_r). When these gradients are antagonistic (g_f·g_r < 0), a regularization term R_AO = γ·(1 – cos(g_f,g_r))² is added, encouraging the updates to become orthogonal. This dynamic penalty reduces gradient conflict, ensuring that parameters responsible for general knowledge are minimally perturbed while those tied to the forget set are targeted.
Second, Adversarial Gating Training (AGT) formulates unlearning as a min‑max game in the latent space of a chosen transformer layer. An inner maximization step searches for a worst‑case perturbation δ (bounded by ‖δ‖ₚ ≤ ε) that would revive forgotten information, using projected gradient descent (PGD). The outer minimization updates the model parameters to minimize the combined loss (forget + retain + AO) under this adversarial perturbation. To prevent instability during early training, a Gradient‑Norm‑Based Gating curriculum is employed: the adversarial inner loop is disabled for an initial warm‑up period, and later activated only when the L₂ norm of the total loss gradient falls below a threshold τ_grad. This ensures that adversarial pressure is applied only when the optimization trajectory is sufficiently stable.
Experiments are conducted on four open‑source LLMs (LLaMA‑2‑7B‑chat, Gemma‑2B‑it, Zephyr‑7B‑beta, ICLM‑7B) across three benchmarks: TOFU (fictional biographies), MUSE (news and books copyright removal), and WMDP (hazardous cybersecurity capabilities). Evaluation metrics cover unlearning efficacy (Forget Quality, Knowledge Unlearning Ratio KUR), utility (Model Utility, fluency), and privacy (Privacy Leakage Ratio PLR, membership inference attacks). AGT⁽ᴬᴼ⁾ consistently achieves KUR ≈ 0.01—an order of magnitude lower than baselines—while maintaining high utility scores (e.g., Model Utility ≈ 0.90, Fluency ≈ 0.90). Compared to gradient‑ascent (GA), negative‑preference optimization (NPO), and other recent methods, AGT⁽ᴬᴼ⁾ shows superior resistance to jailbreak attacks, indicating that the erased knowledge is not merely masked but truly inaccessible.
The paper’s contributions are threefold: (1) a novel AO regularizer that dynamically resolves gradient conflicts, (2) an AGT mechanism that enforces robust unlearning via latent‑space adversarial training, and (3) a curriculum‑based gating strategy that stabilizes training and prevents premature catastrophic forgetting. The combined AGT⁽ᴬᴼ⁾ framework thus bridges the gap between aggressive and conservative unlearning, offering a practical solution for privacy‑compliant LLM deployment without full retraining. Limitations include evaluation only on 7‑billion‑parameter models and sensitivity to hyper‑parameters (γ, ε, τ_grad), suggesting future work on scaling to larger models and automated hyper‑parameter tuning. Overall, AGT⁽ᴬᴼ⁾ sets a new benchmark for safe, efficient, and robust LLM unlearning.


Comments & Academic Discussion

Loading comments...

Leave a Comment