TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents
We address the problem of runtime trajectory anomaly detection, a critical capability for enabling trustworthy LLM agents. Current safety measures predominantly focus on static input/output filtering. However, we argue that ensuring LLM agents reliability requires auditing the intermediate execution process. In this work, we formulate the task of Trajectory Anomaly Detection. The goal is not merely detection, but precise error localization. This capability is essential for enabling efficient rollback-and-retry. To achieve this, we construct TrajBench, a dataset synthesized via a perturb-and-complete strategy to cover diverse procedural anomalies. Using this benchmark, we investigate the capability of models in process supervision. We observe that general-purpose LLMs, even with zero-shot prompting, struggle to identify and localize these anomalies. This reveals that generalized capabilities do not automatically translate to process reliability. To address this, we propose TrajAD, a specialized verifier trained with fine-grained process supervision. Our approach outperforms baselines, demonstrating that specialized supervision is essential for building trustworthy agents.
💡 Research Summary
The paper tackles a previously under‑explored safety dimension of large language model (LLM) agents: the reliability of their execution trajectories. While most safety work focuses on static input‑output filtering or on improving the model’s overall capability, the authors argue that trustworthy agents must also be able to audit their intermediate reasoning‑action‑observation loops. To formalize this need, they introduce the task of Trajectory Anomaly Detection (TAD), which requires not only a binary decision (normal vs. anomalous) but also precise localization of the first erroneous step so that a rollback‑and‑retry mechanism can be invoked.
Three canonical anomaly categories are defined:
- Task Failure (A_fail) – either a reasoning flaw that leads to a wrong but syntactically valid action, or an outright execution error (e.g., invalid tool parameters).
- Process Inefficiency (A_ineff) – the agent reaches the correct final state but does so via redundant or looping steps, meaning a shorter trajectory exists.
- Unwarranted Continuation (A_unw) – the agent continues acting when the task is impossible (fails to refuse) or already completed (ignores a termination signal).
To train and evaluate detectors, the authors construct TrajBench, a large‑scale dataset of paired normal and anomalous trajectories. Starting from the high‑quality AgentBank corpus (covering reasoning, math, coding, web navigation, and embodied AI), they apply a Perturb‑and‑Complete pipeline: a target step is deliberately perturbed according to the taxonomy, then a strong LLM is prompted to complete the remainder of the trace while preserving logical consistency with the injected error. Because the perturbation point is known, precise error‑step labels are automatically generated. The final dataset contains over 60 k trajectories, balanced 1:1 between normal and anomalous, spanning 13 tasks and three anomaly types.
The core model, TrajAD, is a Transformer‑based verifier that ingests the full trajectory and jointly predicts the anomaly verdict and the index of the first error. Training uses a combined loss (binary cross‑entropy for the verdict and a position‑wise loss for the error index). Experiments compare zero‑shot prompting of several state‑of‑the‑art LLMs (GPT‑4, Claude, LLaMA‑2) against the fine‑tuned TrajAD. Zero‑shot models achieve modest detection accuracies (~55 %) and poor localization, confirming that generic capabilities do not transfer to process‑centric monitoring. In contrast, TrajAD reaches >89 % detection accuracy and >85 % localization accuracy, demonstrating the value of dedicated supervision.
A practical rollback‑and‑retry protocol is demonstrated: when TrajAD flags an anomaly at step t, the agent halts before step t + 1, rolls back to state t – 1, and re‑executes from there, avoiding a full task restart. This yields substantial savings in computational resources and mitigates risks such as unintended database modifications.
The paper acknowledges limitations: the synthetic nature of TrajBench may not capture all real‑world failure modes, and the verifier adds inference overhead. Future work is suggested on incorporating real execution logs, developing lightweight real‑time monitors, and exploring tighter integration between verifier and agent for self‑correction.
Overall, the work makes three key contributions: (1) formalizing trajectory‑level anomaly detection as a new safety benchmark, (2) releasing a high‑quality, richly annotated dataset for this purpose, and (3) presenting a specialized verifier that substantially outperforms generic LLMs, thereby advancing the trustworthiness of autonomous LLM agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment