Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
💡 Research Summary
The paper introduces PACT (Prompt‑configured Action via Chain‑of‑Thought), a novel framework for dynamically controlling the safety behavior of large language models (LLMs) while preserving strong safety guarantees. The authors argue that existing alignment approaches rely on a single static safety policy, which inevitably forces a trade‑off between safety and helpfulness: models either over‑refuse benign requests or under‑constrain harmful ones. To resolve this, PACT separates safety constraints into two hierarchical layers.
The Global Policy (P_G) is embedded into the model parameters during training and defines an immutable taxonomy of critical risks (e.g., child safety, violent extremism). When a query triggers any global risk, the model immediately executes a pre‑determined action (GUIDE or REJECT) and bypasses any downstream checks. This early‑exit mechanism guarantees that no user‑provided policy can override the non‑negotiable safety baseline, protecting against jailbreaks and policy‑injection attacks.
The User Policy (P_U) is a runtime‑configurable set of specifications supplied via prompts. It can introduce domain‑specific risk categories and map each risk label to one of three possible actions: COMPLY (fulfil the request), GUIDE (provide a constructive, safety‑aware response), or REJECT (simple refusal). Because P_U is evaluated only after the global check, it can safely increase model utility in specialized contexts (e.g., medical advice, software development) without compromising the global safety floor.
A central technical contribution is the risk‑aware Chain‑of‑Thought Path (CoTPath). The model is trained to output an explicit reasoning trace that first classifies the input according to the global taxonomy, then (if needed) classifies it according to the user taxonomy, and finally selects the appropriate action. For each possible label the authors generate three pre‑written response templates (one for each action) during a self‑distillation phase. This ensures a 100 % consistency between the declared label and the actual response, and provides full transparency: operators can read the model’s internal decision process at inference time.
Data creation proceeds without external annotators. The base model (Qwen‑3‑8B) is prompted to produce (i) a risk label, (ii) a reasoning snippet, and (iii) three action‑specific replies for every query. Over 570 k such triples are collected from a mixture of general QA, curated safety prompts, and a “red” model that generates adversarial risk queries. The resulting dataset, D_distill, is then transformed into CoTPath examples and used for supervised fine‑tuning (SFT). The authors note that reinforcement learning from human feedback (RLHF) could further penalize any deviation from the prescribed label‑action mapping, but this is left for future work.
Experiments evaluate two dimensions: (1) core safety performance on five public benchmarks (Octopus‑Seval, Qwen‑Guard, Llama‑Guard, etc.) and (2) runtime controllability using a scenario‑based test suite (CoSA‑pien) that measures how well the model follows user‑defined policies. PACT matches or slightly exceeds the best‑performing baseline models (including 671 B parameter DeepSeek) on global safety metrics, while achieving the highest policy‑adherence scores among all tested systems. Notably, the over‑refusal rate drops by 27 % and the under‑constraint (missed‑risk) rate falls by 31 % compared with static‑policy baselines, demonstrating a substantial mitigation of the safety‑helpfulness trade‑off.
The paper also discusses limitations. The global taxonomy is fixed; new emergent risks would require retraining the global component. The current action space is limited to three discrete modes, which may be insufficient for nuanced applications that need partial disclosures or multi‑step guidance. Moreover, while the self‑distilled data provides strong consistency, occasional label‑action mismatches could still arise, suggesting that RLHF‑based fine‑tuning would be beneficial.
Future directions include extending the action set (e.g., multi‑step GUIDE), enabling continual updates to the global risk taxonomy via meta‑learning, and integrating RLHF to enforce stricter label‑action alignment.
In summary, PACT offers a principled, transparent, and highly controllable approach to LLM safety alignment. By combining an immutable global safety layer with flexible, prompt‑driven user policies and an explicit chain‑of‑thought reasoning trace, it successfully reduces both over‑refusal and under‑constraint while preserving state‑of‑the‑art safety performance. This work paves the way for deploying LLMs in safety‑critical domains where both strict safeguards and domain‑specific utility are required.
Comments & Academic Discussion
Loading comments...
Leave a Comment