Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.
💡 Research Summary
The paper tackles the pressing problem of jailbreak attacks on large language models (LLMs), which can force a model to produce harmful content despite safety filters. Existing defenses are largely empirical—perplexity‑based filters, “erase‑and‑check” pipelines, or character‑level randomization—and are quickly circumvented by adaptive gradient‑based attacks such as Greedy Coordinate Gradient (GCG) and AutoDAN. To move beyond this cat‑and‑mouse dynamic, the authors propose a provable defense framework that shifts safety guarantees from a single forward pass to the statistical stability of an ensemble of model evaluations.
The core of the framework is Certified Semantic Smoothing (CSS), a novel adaptation of randomized smoothing for discrete token spaces. Inputs are split into two disjoint sets: immutable structural tokens (system prompts, chat templates, delimiters) and mutable semantic payload tokens (the user query). CSS performs Stratified Randomized Ablation: structural tokens are always retained, while a random subset of k semantic tokens is kept and the rest are masked via a randomized attention mask. This mask prevents the model from attending to omitted tokens without deleting them, thereby preserving positional embeddings and avoiding out‑of‑distribution symbols such as
Comments & Academic Discussion
Loading comments...
Leave a Comment