AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents’ step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator’s assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing “Strict” and “Lenient” judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https://github.com/Astarojth/AgentAuditor.
💡 Research Summary
The paper tackles the pressing problem of reliably evaluating the safety and security of large‑language‑model (LLM)‑driven agents. Existing automated evaluators fall into two categories: rule‑based systems, which are fast and interpretable but brittle against implicit, ambiguous, or evolving threats; and LLM‑based judges, which can capture richer semantics but suffer from inconsistency, bias propagation, and limited explainability. Because LLM agents act step‑by‑step, invoking tools, interacting with environments, and making autonomous decisions, a simple content‑only assessment is insufficient. To bridge this gap, the authors introduce AgentAuditor, a training‑free, memory‑augmented reasoning framework that equips an LLM judge with human‑like experiential learning.
AgentAuditor operates in three stages. Stage 1 – Feature Memory Construction converts raw interaction logs into a structured representation. An LLM, prompted with a detailed “semantic extraction” instruction, extracts three key attributes: the application scenario (s), the observed risk type (r), and the agent’s behavior mode (b). These human‑readable triples are stored alongside vector embeddings of the raw text and each attribute, generated by the Nomic‑Embed‑Text‑v1.5 model. This dual representation (structured + vector) enables both interpretability and efficient similarity search.
Stage 2 – Reasoning Memory Construction selects a compact set of representative experiences from the feature memory. After L2‑normalizing each embedding, the vectors are weighted (wc, ws, wr, wb), concatenated, and reduced via PCA. The resulting high‑dimensional vectors are clustered using FINCH, an unsupervised hierarchical algorithm that automatically discovers the intrinsic number of clusters. Approximately 10 % of the dataset (the “shots”) are chosen as exemplars. For each exemplar, a high‑quality chain‑of‑thought (CoT) explanation is generated by the same LLM using a predefined template, producing a reasoning memory that mimics expert case analyses.
Stage 3 – Memory‑Augmented Reasoning handles a new evaluation query. The query’s embeddings are compared against the reasoning memory using cosine similarity; the top‑K most relevant CoT examples are retrieved, weighted, and merged with the query to form a final prompt. The LLM then produces a safety/security judgment, guided by concrete past reasoning traces. This retrieval‑augmented generation (RAG) loop allows the model to “recall” analogous cases, improving consistency and providing traceable explanations.
To benchmark the approach, the authors release ASSEBench, the first large‑scale dataset specifically designed for LLM‑based evaluators of agent safety and security. ASSEBench contains 2,293 meticulously annotated interaction records covering 15 risk types (e.g., tool misuse, biased output, information leakage), 29 application scenarios, 528 environments, and 26 behavior modes. Annotation follows a structured human‑computer collaborative protocol and includes both “strict” and “lenient” judgment standards to capture ambiguous risk assessments.
Extensive experiments across ASSEBench and several existing benchmarks (R‑Judge, ToolEmu, AgentSafetyBench, AgentSecurityBench) demonstrate that AgentAuditor consistently outperforms baselines. With Gemini‑2.0‑Flash‑Thinking, it achieves up to 96.3 % F1 and 96.1 % accuracy on R‑Judge, matching or surpassing human expert performance. Similar gains are observed with GPT‑4 and Claude, across both safety and security tasks. Ablation studies confirm that each component—structured feature extraction, FINCH‑based shot selection, and CoT‑augmented retrieval—contributes significantly to the overall improvement. Moreover, the generated CoT explanations enhance interpretability, addressing a key shortcoming of prior LLM judges.
In summary, AgentAuditor provides a universal, training‑free framework that equips LLM evaluators with a human‑like ability to learn from past cases, retrieve relevant experiences, and reason with explicit chain‑of‑thoughts. Combined with the newly introduced ASSEBench benchmark, the work sets a new state‑of‑the‑art for trustworthy evaluation of LLM agents, paving the way for scalable, interpretable, and human‑aligned safety and security assessment pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment