The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution

The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering. As these systems become more autonomous and are deployed at scale, understanding why an agent takes a particular action becomes increasingly important for accountability and governance. However, existing research predominantly focuses on \textit{failure attribution} to localize explicit errors in unsuccessful trajectories, which is insufficient for explaining \textbf{the reason behind agent behaviors}. To bridge this gap, we propose a novel framework for \textbf{general agentic attribution}, designed to identify the internal factors driving agent actions regardless of the task outcome. Our framework operates hierarchically to manage the complexity of agent interactions. Specifically, at the \textit{component level}, we employ temporal likelihood dynamics to identify critical interaction steps; then at the \textit{sentence level}, we refine this localization using perturbation-based analysis to isolate the specific textual evidence. We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias. Experimental results demonstrate that the proposed framework reliably pinpoints pivotal historical events and sentences behind the agent behavior, offering a critical step toward safer and more accountable agentic systems. Codes are available at https://github.com/AI45Lab/AgentDoG.


💡 Research Summary

The paper addresses a critical gap in the interpretability of large‑language‑model (LLM) based agents: existing work focuses almost exclusively on “failure attribution,” i.e., locating explicit errors in trajectories that end in a failure. This approach cannot explain why an agent takes a particular action when the outcome is successful or otherwise acceptable, yet the decision process may be questionable, biased, or misaligned. To fill this void, the authors propose a novel “agentic attribution” framework that identifies internal drivers of any target action, regardless of the overall task result.

The framework operates hierarchically. At the component level, the agent’s entire interaction history is broken down into an ordered sequence of components (each component corresponds to a single observation, tool output, memory retrieval, internal thought, or user utterance). For each prefix of this sequence, the model’s log‑likelihood of the target action (a_T) is computed: (\psi_i = \log p_{\pi_\theta}(a_T \mid C_{\le i})). The temporal gain (g_i = \psi_i - \psi_{i-1}) quantifies how much the newly introduced component shifts the probability of the realized action. Components with large positive gains are flagged as decision drivers. This “temporal likelihood dynamics” method respects the causal order of tool calls, memory updates, and reasoning steps, unlike flat‑context attribution methods such as RAG.

At the sentence level, the framework refines the attribution within each high‑impact component. The component is split into sentences (s_{i,1},\dots,s_{i,N_i}). For each sentence, a perturbation‑based ablation is performed: the sentence is removed, the model recomputes (p(a_T)), and the drop in probability is recorded as the sentence’s attribution score. This causal perturbation directly measures the contribution of individual textual evidence. The authors note that alternative fine‑grained methods (gradient‑based saliency, attention weights) can be plugged into the same pipeline, but their experiments show perturbation yields the most stable and interpretable results.

The authors evaluate the framework on Llama‑3.1‑70B‑Instruct across three carefully constructed agentic scenarios:

  1. Standard tool‑use tasks – where the agent must retrieve web information, execute code, and synthesize a response. Temporal gains correctly highlight the tool output that makes the final answer possible.
  2. Memory‑induced bias – a customer‑service example where the agent issues an unwarranted refund because a high‑success‑rate “refund” memory entry dominates the decision. Both component and sentence attribution pinpoint the exact memory snippet (“refund action with 99.5% success”) as the cause.
  3. Tool‑conditioned hallucination – the agent generates a plausible but false claim based on a noisy tool result. The framework isolates the misleading tool observation as the decisive component.

Qualitative analysis demonstrates that the proposed method recovers the same causal factors that human experts identify, while traditional failure‑centric methods remain silent on these “successful‑yet‑problematic” cases. The paper also compares three sentence‑level attribution techniques; perturbation consistently outperforms the others in both precision and robustness.

Key contributions are:

  • A generalizable agentic attribution problem formulation that quantifies the influence of historical components and individual sentences on any target action.
  • Temporal likelihood dynamics as a principled, model‑agnostic metric for component‑level influence, preserving causal order.
  • Perturbation‑based sentence attribution that yields fine‑grained, causal explanations of textual evidence.
  • Extensive empirical validation on diverse scenarios, showing the framework’s ability to uncover hidden biases, tool‑related errors, and other reliability risks.
  • Open‑source release of code and benchmark data, facilitating reproducibility and future extensions.

The authors acknowledge limitations: computing log‑likelihoods requires access to the underlying LLM (black‑box APIs may not expose this), the approach can be computationally expensive for long trajectories, and sentence ablation may disrupt context in ways that over‑estimate influence. Future work is suggested on efficient approximations, multimodal evidence attribution, and integration with safety‑oriented reward models to automatically flag risky decisions.

Overall, the paper makes a substantial step toward transparent, accountable LLM agents by moving beyond “what went wrong” to “why did this happen,” offering tools that could be vital for governance, debugging, and trustworthy deployment of autonomous AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment