Jailbreaking LLMs via Calibration

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model’s aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower “Jailbreak Tax” compared with existing methods, especially on the safety-hardened gpt-oss-120b.

💡 Research Summary

The paper tackles the problem of “jailbreaking” large language models (LLMs) that have been safety‑aligned, i.e., trained to refuse or suppress harmful content. The authors observe that safety alignment systematically distorts the model’s next‑token distribution relative to a hypothetical pre‑alignment distribution that reflects the model’s full capabilities. They formalize this distortion as a statistical mis‑calibration error and propose to view the recent “Weak‑to‑Strong” jailbreak paradigm as a forecast‑aggregation problem.

Three models are involved: (1) a high‑capacity, strongly aligned target model (π_t); (2) a smaller, unaligned helper model (π_h) that approximates the pre‑alignment distribution; and (3) a predictor model (π_{t|h}) that predicts the target’s output conditioned on the helper’s output. Under the key assumptions that the helper is calibrated to the pre‑alignment outcome and that the predictor provides the conditional expectation of the target, the authors derive an optimal aggregation rule in the dual (gradient) space induced by any strictly proper loss ℓ.

Every proper loss ℓ can be expressed via a convex generator G; the gradient ∇G maps probabilities to a dual space, and the associated Bregman divergence D_G measures excess risk. The optimal rule—called the Gradient Shift—updates the target’s dual representation by the difference between the helper’s and predictor’s duals:

∇G(q*) = ∇G(π_t) – λ·

Jailbreaking LLMs via Calibration

💡 Research Summary

Comments & Academic Discussion

Leave a Comment