Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs
Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.
💡 Research Summary
Paper Overview
The authors introduce a novel cost‑amplification threat for conversational large language models (LLMs) called turn amplification. Unlike prior attacks that manipulate each user prompt to force a single, unusually long response, turn amplification biases the model’s dialogue dynamics so that it repeatedly asks clarification questions, thereby extending the number of interaction turns without completing the original task. This behavior inflates inference costs because, in multi‑turn deployments, the entire conversation history must be re‑processed at every turn, making the total token count (and thus compute) proportional to the number of turns.
Threat Model
An adversary does not need control over user inputs. Instead, the attacker can (i) embed a small amount of malicious fine‑tuning data (e.g., LoRA adapters) or (ii) corrupt a handful of model weights at runtime (bit‑flip attacks). Both approaches shift the model toward a universal activation subspace that promotes clarification‑seeking responses regardless of the specific query. The attack is therefore scalable, hard to detect, and works across prompts, tasks, and model sizes.
Mechanistic Insight
The key hypothesis is that clarification‑seeking is encoded in a query‑independent direction in the model’s residual‑stream activation space. To discover this direction, the authors generate a large synthetic dataset of 5,000 ten‑turn dialogues using a strong LLM (Qwen2.5‑32B) that is forced to avoid answering the original question and instead keep asking follow‑up clarifications. Because turn amplification is rarely exhibited by default, traditional difference‑of‑means (DIM) methods fail. Instead, they employ a gradient‑based optimization that directly learns a linear steering vector v separating “turn‑amplifying” from “non‑amplifying” activations. Adding v to the residual stream during inference (activation steering) reliably increases the probability of a clarification response.
Evaluation Framework
Since large‑scale human studies are impractical, the authors adopt an LLM‑as‑a‑Judge protocol. They use Qwen2.5‑32B as a judge to (1) decide after each turn whether the original user query has been fully answered, and (2) generate simulated user replies in two modes: Easy (cooperative) and Hard (pressuring the model to finish). The conversation proceeds until the judge signals completion or a maximum turn limit is reached. Metrics recorded are:
- Turns: number of assistant turns until completion,
- In‑Tokens / Out‑Tokens: cumulative token counts (proxy for compute cost),
- Accuracy: correctness of the final answer when ground truth exists.
Experimental Findings
Four instruction‑tuned LLMs (3 B–22 B parameters) are evaluated on two multi‑turn benchmarks. Results show:
- Up to 9.9× more turns,
- 200.1× increase in input tokens,
- 6.4× increase in output tokens,
while final answer accuracy remains largely unchanged (≈85‑90%).
Fine‑tuning with LoRA adapters covering only 0.03 % of parameters yields up to 9.2× more turns. Targeted bit‑flip attacks modifying just 25 weights achieve up to 4.6× turn inflation. These attacks require no per‑turn intervention after deployment.
Defenses
Standard defenses—output‑length caps, token‑anomaly detectors, and prompt‑level anomaly filters—are ineffective because the model’s responses remain well‑formed and policy‑compliant. The authors experiment with monitoring the proportion of clarification questions and applying activation regularization, but these preliminary mitigations only modestly reduce the effect and are not robust.
Implications and Future Work
The paper demonstrates that conversational LLMs can be steered at the representation level to inflate operational costs, revealing a previously unstudied failure mode that links internal activations, dialogue dynamics, and economic impact. Future directions include: (1) better characterization and interpretability of the universal activation subspace, (2) training objectives or regularizers that penalize unnecessary clarification behavior, and (3) deployment‑level monitoring tools that can detect abnormal turn growth in real time.
In summary, “turn amplification” expands the attack surface of conversational AI beyond content safety, highlighting the need for new security paradigms that consider multi‑turn dynamics and internal model representations.
Comments & Academic Discussion
Loading comments...
Leave a Comment