Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise} evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.

💡 Research Summary

DriftBench is introduced as the first diagnostic benchmark that evaluates the pragmatic robustness of large‑language‑model (LLM) agents when user inputs contain systematic faults. The authors argue that existing evaluations largely assume “oracle” inputs—clear, complete, and correct instructions—and therefore miss the execution risks that arise when real users provide ambiguous, incomplete, or erroneous commands. Drawing on Grice’s Cooperative Principle, Austin’s speech‑act theory, and Watzlawick’s interactional axioms, the paper defines a unified taxonomy of four fault categories: flaw of intention, flaw of premise, flaw of parameter, and flaw of expression. Each category corresponds to a violation of a conversational maxim (relevance, quality, quantity, manner) and maps directly onto the kinds of misunderstandings that can cause an agent to take unsafe actions.

To operationalize this taxonomy, the authors construct a dataset that spans two complementary execution environments. State‑oriented environments (e.g., local OS, databases) serve as “white‑box” settings where the agent can inspect internal state, while service‑oriented environments (e.g., external REST APIs) act as “black‑box” settings where the agent must rely on opaque request‑response cycles. Tasks are first filtered through an oracle pipeline: three moderately capable models must solve a task without any fault for it to be retained, guaranteeing that later failures are attributable to introduced faults rather than intrinsic task difficulty.

Fault injection proceeds in three stages: (1) semantic frame extraction, where LLMs parse each task into structured action types, required parameters, and expected outputs; (2) perturbation strategy generation, which creates controlled variations for each of the four fault types (e.g., swapping the user’s goal, falsifying a presupposition, omitting a required argument, or inserting lexical/syntactic ambiguity); and (3) perturbation injection, which applies these variations to produce a suite of faulty task variants while preserving the original ground‑truth description for evaluation.

On the agent side, the benchmark augments the traditional “command‑and‑execute” loop with five explicit clarification tools: Ask Parameter, Disambiguate, Propose Solution, Confirm Risk, and Report Blocker. When an agent detects uncertainty, it emits a structured clarification request (e.g., “Action: Clarify Strategy: Ask_Parameter Content: Which date would you like to filter the orders by?”). These requests are routed to a persona‑driven user simulator that implements five decision‑making styles—Rational, Intuitive, Dependent, Avoidant, and Spontaneous—derived from the General Decision‑Making Style framework. The simulator’s responses generate multi‑turn dialogues that mimic realistic human‑agent interactions.

Evaluation follows the RISE protocol, which jointly measures (R)esult (task success), (I)nteraction (number of dialogue turns), (S)atisfaction (estimated user satisfaction), and (E)fficiency (clarification cost). Experiments with state‑of‑the‑art models (GPT‑4, Claude‑2, Llama‑2‑70B) reveal a dramatic performance drop of roughly 40 % when any fault is present. Notably, the authors discover a “Clarification Paradox”: in transparent white‑box environments, additional clarification turns often recover performance, whereas in opaque black‑box settings the same multi‑turn exchanges can overload context and further degrade success. Moreover, agents exhibit an “execution bias,” proceeding with high‑risk actions in about 70 % of fault cases instead of seeking clarification, highlighting a safety concern.

The paper’s contributions are threefold: (1) a theoretically grounded taxonomy of input faults that unifies prior ad‑hoc classifications; (2) the DriftBench benchmark itself, which couples controlled fault injection with multi‑turn clarification in both white‑box and black‑box execution contexts; and (3) the RISE evaluation framework that quantifies both outcome quality and interaction efficiency. By exposing systematic cooperative breakdowns and linking them to downstream safety outcomes, DriftBench provides a reproducible platform for future research on robust, self‑clarifying LLM agents. The authors suggest extensions toward multimodal tools, long‑term memory management, and live user studies, positioning DriftBench as a foundational step toward safer, more reliable autonomous language agents.

Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

💡 Research Summary

Comments & Academic Discussion

Leave a Comment