AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.


💡 Research Summary

AgentNoiseBench addresses a critical gap in the evaluation of large‑language‑model (LLM) based agents: the mismatch between idealized benchmark conditions and the noisy, stochastic environments in which these agents are deployed. The authors first conduct an empirical analysis of real‑world user‑agent interaction logs, identifying two dominant sources of perturbation: user‑noise (ambiguity, inconsistency, redundancy, topic drift, and boundary probing) and tool‑noise (execution failures, incomplete responses, erroneous outputs, misleading signals, and redundant information). These categories are grounded in observed frequencies and interaction costs, ensuring that the noise types reflect genuine deployment challenges.

To systematically inject such noise while preserving task solvability, the paper introduces a constrained adversarial noise generation pipeline. A fixed reference agent (A_ref) is used to guide the optimization of a noise generator G’s prompt parameters (θ). The objective maximizes performance degradation L_deg(A_ref, G(x;θ)) under the hard constraint that the perturbed input remains solvable (I_solvable = 1). The resulting optimal prompt θ* is frozen and applied uniformly to all evaluated agents, guaranteeing that every model faces the same, valid yet challenging perturbations. This design avoids the pitfall of making a task impossible, thereby attributing failures to model fragility rather than task invalidity.

Beyond standard outcome‑based metrics, the authors propose a “Trajectory‑Aware Evaluation Protocol.” For each interaction, the full reasoning trajectory τ = (s₁,…,s_T) is recorded. A step‑wise validity indicator I_step(s_i, T) checks whether the agent’s behavior at step i conforms to the task specification despite noise. The overall trajectory validity I_traj = ∧_i I_step is then combined with the final answer correctness I_task to form a stability‑gated success criterion SGA(τ; T) = I_traj · I_task. This gating filters out cases where a correct final answer is obtained through a noisy‑induced detour, providing a more faithful measure of robustness.

The benchmark is instantiated on three representative agent‑centric tasks: τ²‑Bench (tool‑use), VitaBench (search), and HotPotQA (complex QA). The authors evaluate 24 models spanning open‑source and proprietary families (DeepSeek, Gemini, GPT‑4.1, GPT‑5.2, Claude, Qwen, GLM, etc.), covering both “thinking” (chain‑of‑thought enabled) and non‑thinking variants. Results show a consistent performance drop under both user‑noise and tool‑noise, but the magnitude varies widely. Tool‑noise proves especially detrimental: most models suffer larger accuracy losses compared to user‑noise, likely because tool failures can abruptly break the reasoning chain. Thinking‑enabled models generally exhibit higher resilience, yet even the strongest models (e.g., GPT‑5.2, Claude‑4.5‑Sonnet) still experience notable degradation, indicating that current chain‑of‑thought mechanisms are not sufficient to fully mitigate environmental perturbations.

Crucially, the study finds a weak correlation between a model’s baseline reasoning ability (measured on clean benchmarks) and its robustness to noise, suggesting that traditional evaluation pipelines may overestimate real‑world reliability. The paper also analyses step‑wise entropy of trajectories, revealing that user‑noise tends to increase uncertainty early in the interaction, while tool‑noise spikes entropy later when the agent relies on external tool outputs.

AgentNoiseBench contributes a reproducible, open‑source pipeline (code released at https://github.com/keven‑cyber/agentnoisebench) that standardizes noise taxonomy, constrained adversarial injection, and trajectory‑aware evaluation. By making noise‑robustness a first‑class evaluation criterion, the work paves the way for developing LLM agents that can maintain reliable performance in the messy, unpredictable conditions of real‑world deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment