Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

January 01, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
ArXiv ID: 2601.00454
Date: 2026-01-01
Authors: Hyunjun Kim

📝 Abstract

Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from O(n 2 ) to O(n) for n-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline-a 93× reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

💡 Deep Analysis

📄 Full Content

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations Hyunjun Kim KAIST hyunjun1121@kaist.ac.kr Abstract Guardrail models are essential for ensuring the safety of Large Language Model (LLM) de- ployments, but processing full multi-turn con- versation histories incurs significant computa- tional cost. We propose Defensive M2S, a train- ing paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis show- ing that M2S reduces training cost from O(n2) to O(n) for n-turn conversations. Empiri- cally, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline—a 93× reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three com- pression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehen- sive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per con- versation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compres- sion can serve as an effective efficiency tech- nique for guardrail deployment, enabling scal- able safety screening of long multi-turn conver- sations. 1 Introduction Large Language Models (LLMs) have demon- strated remarkable capabilities across diverse tasks, but their susceptibility to adversarial attacks re- mains a critical concern. Among these threats, multi-turn jailbreak attacks represent a particularly insidious category, where adversaries gradually ma- nipulate LLMs through a series of carefully crafted conversational turns to bypass safety guardrails and elicit harmful outputs. Guardrail models serve as a crucial defense mechanism, acting as classifiers that evaluate whether a given input-output pair is safe or unsafe. However, deploying these models for multi-turn conversations presents significant computational challenges: processing full conversation histories requires substantial token throughput, leading to in- creased latency and cost at inference time. As con- versations grow longer, the computational burden scales linearly, making real-time safety screening increasingly expensive. Recent work on Multi-turn to Single-turn (M2S) compression (Ha et al., 2025) has shown that multi- turn jailbreak attacks can be distilled into compact single-turn prompts that preserve their adversarial effectiveness. This insight, while concerning from a security perspective, suggests an intriguing defen- sive application: if the essential semantics of multi- turn attacks can be captured in compressed form, perhaps guardrail models can be trained to recog- nize these compressed representations directly. In this paper, we propose Defensive M2S, a train- ing paradigm that fine-tunes guardrail models on M2S-compressed conversation histories rather than full multi-turn dialogues. Our key hypothesis is that M2S compression maintains the semantic in- formation necessary for accurate safety classifica- tion while dramatically reducing the computational cost of inference. We validate this hypothesis through extensive experiments on three guardrail model families (Lla- maGuard, Nemotron, and Qwen3Guard) across multiple M2S compression templates (hyphenize, numberize, pythonize). Our evaluation on Safe- DialBench (Chen et al., 2025), a comprehensive multi-turn jailbreak benchmark comprising 2,037 samples across 6 attack categories and 7 attack 1 arXiv:2601.00454v1 [cs.CL] 1 Jan 2026 methods, reveals several key findings: • Efficiency-Accuracy Trade-off: M2S- trained models achieve up to 94.6% token reduction while maintaining competitive detection accuracy. The best configuration (Qwen3Guard with hyphenize template) achieves 93.8% recall compared to 54.9% baseline recall, demonstrating that com- pression can actually improve detection performance for certain model-template combinations. • Model-Template Sensitivity: The effective- ness of M2S training varies significantly across model-template combinations, with Qwen3Guard favoring hyphenize (93.8% re- call) while Nemotron performs best with num- berize (87.8% recall). • Single-Template Superiority: Training on a single compression template outper- forms mixed-template training, suggesting that template-specific representations provide stronger learning signals than diverse but in- consistent formats. Our contributions can be summarized as follows: 1. We introduce Defensive M2S, a novel training paradigm that leverages adversarial compres- sion techniques for efficient guardrail deploy- ment. 2. We provide formal complexity analysis show- ing M2S reduces training cost from O(n2) to O(n)

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations

Robust Uncertainty Quantification for Factual Generation of Large Language Models

Do Large Language Models Know What They Are Capable Of?

Start searching

No results found