Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.


💡 Research Summary

The paper tackles a pressing problem in the lifecycle of large language models (LLMs): safety degradation that can occur when a model is fine‑tuned on downstream data, whether the data are benign or malicious. Existing defenses typically rely on a static regularization term—most commonly a fixed KL‑penalty that forces the fine‑tuned model to stay close to a pre‑aligned reference policy. Such static constraints suffer from an inherent trade‑off: a weak penalty fails to stop safety drift under adversarial fine‑tuning, while a strong penalty hampers the model’s ability to adapt to new tasks, degrading utility.

To overcome this limitation, the authors propose an adaptive regularization framework that modulates the strength of the KL regularizer in real time based on a training‑time safety risk signal. The framework consists of two components: (1) a Safety Critic that estimates per‑example or per‑batch harmfulness, and (2) an Adaptive Alignment Objective that uses the critic’s output to dynamically balance the standard supervised negative log‑likelihood (NLL) loss with the KL regularization term.

Two distinct critics are explored:

  1. Judge‑Based Safety Critic – an external LLM (e.g., gpt‑oss‑20b) evaluates the model’s generated response (and the reference model’s response) against 11 safety categories (violence, hate, fraud, etc.) and returns a high‑level harm score. This score provides a high‑recall safety signal but incurs considerable inference cost because the judge must be queried for every training example.

  2. Activation‑Based Risk Predictor – a lightweight classifier operates on the internal hidden state of the model before any token is generated. The authors first demonstrate that harmful intent is linearly predictable from these pre‑generation activations: logistic probes trained on hidden vectors achieve AUROC > 0.9 across multiple model families and both in‑distribution and out‑of‑distribution tests. Layer‑wise ablations show that early and late layers both contain strong signals, prompting the authors to pool evidence across layers (mean, max, or weighted averaging) to obtain a robust scalar risk score. This approach adds zero inference overhead during fine‑tuning because the probe can be run on the same forward pass used for the main task.

The Adaptive Alignment Objective is defined as

L_total(t) = α_t · L_NLL + β_t · L_KL

where L_NLL is the standard supervised loss and L_KL = E_x


Comments & Academic Discussion

Loading comments...

Leave a Comment