CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model’s token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.


💡 Research Summary

The paper addresses the pressing problem of selectively removing undesirable knowledge—such as copyrighted text, hazardous instructions, or personal data—from large language models (LLMs) without sacrificing their general capabilities. Traditional unlearning approaches rely on Gradient Ascent (GA), which simply maximizes the loss on “forgetting” data. While effective at reducing the targeted knowledge, GA indiscriminately penalizes all tokens, leading to catastrophic forgetting of useful information. Moreover, many GA‑based methods require a retention dataset to preserve general knowledge, which is often unavailable in practice.

Negative Preference Optimization (NPO) improves on GA by framing unlearning as a preference‑ranking problem: the model should be less likely to produce the undesirable response than a reference model. However, NPO typically uses a static reference model (the pre‑unlearning checkpoint) and still depends on retention data for stability. This static reference provides diminishing guidance as the target model learns, especially when the reference already assigns high probability to the unwanted token. Consequently, NPO’s effectiveness plateaus, and it suffers from length bias because the loss aggregates over entire sequences, giving longer examples disproportionate influence.

The authors propose CATNIP (Calibrated and Tokenized Negative Preference Alignment), a novel unlearning framework that tackles both the reference‑model and length‑bias issues. Its two core innovations are:

  1. Adaptive Reverse Reference Policy – Instead of a fixed π_ref, CATNIP defines the reference policy π_β as the complement of the current model’s distribution, π_β(y|x) = 1 − π_θ(y|x). This “reverse” policy dynamically adapts as the model changes. When the model is highly confident about an undesirable token (π_θ≈1), the margin 1/(1 − π_θ) becomes large, amplifying the penalty for that token. Conversely, low‑confidence tokens receive milder updates. This calibration directly ties the magnitude of gradient updates to the model’s token‑level confidence, ensuring that the most entrenched knowledge is targeted most aggressively.

  2. Token‑Level Unlearning Objective – CATNIP treats each token y_i in the response as an independent training sample. The loss is the average over tokens:

    L_CATNIP(θ) = 𝔼_{(x,y)∈D}


Comments & Academic Discussion

Loading comments...

Leave a Comment