PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction
LLM-based agents can complete tasks correctly yet still frustrate users through poor interaction patterns, such as excessive confirmations, opaque reasoning, or misaligned pacing. Current benchmarks evaluate task accuracy but overlook how agents interact: whether they infer preferences from implicit cues, adapt dynamically, or maintain fine-grained interaction quality. We introduce Prefix, a configurable environment that evaluates both what agents accomplish and how they interact. Central to Prefix is the Interaction-as-a-Tool (IaaT) paradigm, which treats interaction behaviors as structured tool calls, unifying them with existing evaluation frameworks. We define 31 preference settings across 14 attributes and formalize user experience (UX) as a core metric alongside task accuracy. A composite LLM-as-a-Judge mechanism across seven UX dimensions achieves strong aggregate reliability (ICC > 0.79), high internal consistency (alpha = 0.943), and human correlation (rho = 0.52-0.78). Preference-aware agents show 7.6% average UX improvement and 18.5% gain in preference alignment. Our work is openly accessible.
💡 Research Summary
The paper “PrefIx: Understand and Adapt to User Preference in Human‑Agent Interaction” addresses a critical gap in current evaluations of large‑language‑model (LLM) based agents. While existing benchmarks focus on task correctness, tool‑use accuracy, robustness, and safety, they largely ignore how agents interact with users—whether they infer implicit preferences, adapt their behavior dynamically, and maintain fine‑grained interaction quality. To fill this void, the authors introduce PrefIx, a configurable environment that simultaneously measures task performance and user experience (UX).
The core innovation is the Interaction‑as‑a‑Tool (IaaT) paradigm. Traditional evaluation treats tool calls as the only structured actions of an agent. IaaT expands the action space by representing interaction behaviors—such as confirmation requests, explanatory narratives, and pacing controls—as explicit tool calls. This unifies interaction and system tool usage under a single, measurable framework, allowing precise trajectory matching between generated and ground‑truth interaction sequences.
PrefIx builds on the Berkeley Function Calling Leaderboard (BFCL) by first “coarsening” its highly specified prompts into higher‑level task instructions that preserve deterministic tool‑use ground truth while allowing flexible multi‑turn dialogues. This creates opportunities for agents to exhibit preference‑driven behavior without compromising task validity.
User simulation is another pillar. The authors define 14 interaction‑preference attributes across four dimensions: Transparency & Auditability, Interaction Pace & Flow, Strategy & Initiative, and Robustness & Adaptability. Each attribute has 2–3 concrete settings, yielding 31 distinct preference profiles. Simulated users, implemented as LLMs, are assigned one profile and are instructed to express their preferences implicitly through conversation style, avoiding explicit self‑disclosure. This design tests agents’ ability to infer and adapt to realistic, subtle cues.
Evaluation is performed by a composite LLM‑as‑a‑Judge system that rates generated interaction trajectories on seven UX dimensions (e.g., satisfaction, efficiency, cognitive load, frustration). For each dimension the judge outputs a Likert score, a justification, and turn‑level evidence. The judges demonstrate strong internal consistency (Cronbach’s α = 0.943) and inter‑rater reliability (ICC > 0.79). Correlation with human annotators ranges from ρ = 0.52 to 0.78, indicating that the automated scores are meaningfully aligned with human judgment.
Experimental results compare a baseline BFCL‑style agent with a “preference‑aware” agent trained or prompted to infer user preferences. Task accuracy remains comparable, confirming that preference adaptation does not sacrifice correctness. However, the preference‑aware agent achieves an average 7.6 % improvement in overall UX scores and a 18.5 % increase in preference alignment metrics. These gains illustrate that agents can deliver the same functional outcomes while providing a smoother, less frustrating user experience.
The paper also surveys related benchmarks, showing that most lack multi‑turn support, dynamic preference handling, or interaction‑level evaluation. PrefIx uniquely integrates all three layers—understanding, planning/execution, and generation—into a single reproducible framework.
Limitations are acknowledged: the simulated users are themselves LLMs, which may not capture the full variability of human behavior, and the set of 31 preference settings, while extensive, is not exhaustive. Future work should expand the preference taxonomy, incorporate real‑user studies, and explore automated preference discovery beyond predefined categories.
In summary, PrefIx offers a novel, scalable, and reliable methodology for evaluating both “what” an LLM‑agent does and “how” it does it, positioning user experience as a first‑class metric in the development of human‑centered AI agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment