Title: Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
ArXiv ID: 2601.00454
Date: 2026-01-01
Authors: Hyunjun Kim
📝 Abstract
Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from O(n 2 ) to O(n) for n-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline-a 93× reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.
💡 Deep Analysis
📄 Full Content
Defensive M2S: Training Guardrail Models on Compressed Multi-turn
Conversations
Hyunjun Kim
KAIST
hyunjun1121@kaist.ac.kr
Abstract
Guardrail models are essential for ensuring the
safety of Large Language Model (LLM) de-
ployments, but processing full multi-turn con-
versation histories incurs significant computa-
tional cost. We propose Defensive M2S, a train-
ing paradigm that fine-tunes guardrail models
on Multi-turn to Single-turn (M2S) compressed
conversations rather than complete dialogue
histories.
We provide a formal complexity analysis show-
ing that M2S reduces training cost from O(n2)
to O(n) for n-turn conversations.
Empiri-
cally, on our training dataset (779 samples, avg.
10.6 turns), M2S requires only 169K tokens
compared to 15.7M tokens for the multi-turn
baseline—a 93× reduction.
We evaluate Defensive M2S across three
guardrail
model
families
(LlamaGuard,
Nemotron, Qwen3Guard) and three com-
pression templates (hyphenize, numberize,
pythonize) on SafeDialBench, a comprehen-
sive multi-turn jailbreak benchmark. Our best
configuration, Qwen3Guard with hyphenize
compression, achieves 93.8% attack detection
recall while reducing inference tokens by
94.6% (from 3,231 to 173 tokens per con-
versation). This represents a 38.9 percentage
point improvement over the baseline while
dramatically
reducing
both
training
and
inference costs.
Our findings demonstrate that M2S compres-
sion can serve as an effective efficiency tech-
nique for guardrail deployment, enabling scal-
able safety screening of long multi-turn conver-
sations.
1
Introduction
Large Language Models (LLMs) have demon-
strated remarkable capabilities across diverse tasks,
but their susceptibility to adversarial attacks re-
mains a critical concern. Among these threats,
multi-turn jailbreak attacks represent a particularly
insidious category, where adversaries gradually ma-
nipulate LLMs through a series of carefully crafted
conversational turns to bypass safety guardrails and
elicit harmful outputs.
Guardrail models serve as a crucial defense
mechanism, acting as classifiers that evaluate
whether a given input-output pair is safe or unsafe.
However, deploying these models for multi-turn
conversations presents significant computational
challenges: processing full conversation histories
requires substantial token throughput, leading to in-
creased latency and cost at inference time. As con-
versations grow longer, the computational burden
scales linearly, making real-time safety screening
increasingly expensive.
Recent work on Multi-turn to Single-turn (M2S)
compression (Ha et al., 2025) has shown that multi-
turn jailbreak attacks can be distilled into compact
single-turn prompts that preserve their adversarial
effectiveness. This insight, while concerning from
a security perspective, suggests an intriguing defen-
sive application: if the essential semantics of multi-
turn attacks can be captured in compressed form,
perhaps guardrail models can be trained to recog-
nize these compressed representations directly.
In this paper, we propose Defensive M2S, a train-
ing paradigm that fine-tunes guardrail models on
M2S-compressed conversation histories rather than
full multi-turn dialogues. Our key hypothesis is
that M2S compression maintains the semantic in-
formation necessary for accurate safety classifica-
tion while dramatically reducing the computational
cost of inference.
We validate this hypothesis through extensive
experiments on three guardrail model families (Lla-
maGuard, Nemotron, and Qwen3Guard) across
multiple M2S compression templates (hyphenize,
numberize, pythonize). Our evaluation on Safe-
DialBench (Chen et al., 2025), a comprehensive
multi-turn jailbreak benchmark comprising 2,037
samples across 6 attack categories and 7 attack
1
arXiv:2601.00454v1 [cs.CL] 1 Jan 2026
methods, reveals several key findings:
• Efficiency-Accuracy
Trade-off:
M2S-
trained models achieve up to 94.6% token
reduction while maintaining competitive
detection accuracy. The best configuration
(Qwen3Guard with hyphenize template)
achieves 93.8% recall compared to 54.9%
baseline recall, demonstrating that com-
pression can actually improve detection
performance
for
certain
model-template
combinations.
• Model-Template Sensitivity: The effective-
ness of M2S training varies significantly
across model-template combinations, with
Qwen3Guard favoring hyphenize (93.8% re-
call) while Nemotron performs best with num-
berize (87.8% recall).
• Single-Template Superiority:
Training
on a single compression template outper-
forms mixed-template training, suggesting
that template-specific representations provide
stronger learning signals than diverse but in-
consistent formats.
Our contributions can be summarized as follows:
1. We introduce Defensive M2S, a novel training
paradigm that leverages adversarial compres-
sion techniques for efficient guardrail deploy-
ment.
2. We provide formal complexity analysis show-
ing M2S reduces training cost from O(n2) to
O(n)