Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs
Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model’s inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl’s Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
💡 Research Summary
The paper introduces a novel jailbreak framework for large language models (LLMs) that treats safety alignment mechanisms as hidden confounders in a causal graph. By modeling the harmful query (X), the model’s internal representation (A), the final response (Y), and an unobserved safety component (U), the authors show that conventional attacks fail because they only exploit surface‑level correlations while the safety guardrail U simultaneously influences both the representation and the output. To break this confounding effect, they invoke Pearl’s Front‑Door Criterion, introducing an observable mediator S that captures the core task semantics of the query and is assumed independent of U. The causal effect of the query on the response can then be expressed as a front‑door adjustment:
(P(Y|do(A)) = \sum_s P(S=s|A) \sum_{a’} P(Y|A=a’,S=s)P(A=a’)).
Operationalizing this formula requires two technical breakthroughs. First, the authors employ Sparse Autoencoders (SAEs) to decompose the high‑dimensional hidden activations of an LLM into a set of sparse latent features. By constructing a contrastive dataset consisting of (i) original harmful prompts that trigger refusal and (ii) jailbreak variants that preserve the same task intent but evade refusal, they statistically separate “defense‑related” directions (denoted d) from “task‑intent” directions (denoted S). Second, they apply weight orthogonalization: the model’s output weight matrix (W_{out}) is projected onto the orthogonal complement of the defense direction d, effectively removing the causal pathway from U to Y. This structural intervention collapses the expensive marginalization in the front‑door formula into a deterministic forward pass with O(1) inference cost, making the attack training‑free and computationally cheap.
The resulting method, called Causal Front‑Door Adjustment Attack (CFA²), is evaluated on several state‑of‑the‑art LLMs (GPT‑3.5, LLaMA‑2, Claude, etc.) across multiple safety benchmark datasets. CFA² achieves an average attack success rate (ASR) of 83.68 %, surpassing prior optimization‑based attacks such as GCG and PAIR, while also producing prompts that are more natural and stealthy according to human evaluations. Ablation studies confirm that both the SAE‑based mediator identification and the weight orthogonalization are essential for the observed performance gains.
The paper’s contributions are threefold: (1) a causal formulation of jailbreak that models safety mechanisms as unobserved confounders and leverages the front‑door criterion to theoretically justify confounder removal; (2) a training‑free implementation that combines sparse representation learning with a simple linear projection to physically strip the defense subspace; (3) empirical evidence of superior robustness, efficiency, and stealth compared to existing methods. Limitations include the reliance on a clear separation between task‑intent and defense features, which may be blurred in models with more entangled safety mechanisms, and the need for contrastive data to discover the defense direction. Nonetheless, the work opens a new line of research that applies causal inference tools to both attack and defense of LLMs, suggesting that future safety designs should consider causal independence from task semantics to resist front‑door style attacks.
Comments & Academic Discussion
Loading comments...
Leave a Comment