TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code
Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.
💡 Research Summary
TraceCoder addresses a critical gap in the emerging field of automated repair for code generated by large language models (LLMs). While LLMs excel at producing syntactically correct programs, they frequently introduce subtle logical bugs that are hard to detect with simple pass/fail test feedback. Existing self‑debugging approaches treat the program as a black box, relying solely on binary test outcomes and lacking mechanisms to learn from past failures. TraceCoder proposes a human‑inspired, multi‑agent framework that decomposes the debugging workflow into three distinct stages: instrumentation, analysis, and repair, each embodied by a specialized agent.
The Instrumentation Agent first observes a failing execution and automatically injects lightweight diagnostic probes—such as print statements or variable logging—into the source code. Probe placement is guided by a lightweight reasoning process that considers the recent error message, control‑flow structure, and prior instrumentation suggestions. The resulting instrumented code preserves the original semantics while emitting fine‑grained runtime traces that capture entry/exit of loops, key variable values, and branch decisions.
These traces are consumed by the Analysis Agent, which performs causal reasoning to pinpoint the root cause of the failure. The agent builds a causal graph from the trace data, correlates variable changes with observed assertion failures, and applies rule‑based heuristics to isolate the most suspicious statements. A novel Historical Lesson Learning Mechanism (HLLM) augments this process: it maintains a repository of “error‑repair‑outcome” triples from all previous debugging sessions. When the current trace exhibits patterns similar to past cases, the HLLM surfaces relevant lessons, biasing the causal inference toward proven repair strategies and preventing the system from repeating the same mistakes.
Guided by the analysis, the Repair Agent formulates a concrete repair plan and translates it into a prompt for the underlying LLM. The LLM synthesizes a patched version of the code, which is immediately re‑executed against the test suite. If the patched code still fails, a Rollback Mechanism (RM) restores the last known correct version and triggers a new iteration of the loop. The RM enforces a strict improvement condition: each cycle must either increase the number of passed tests or reduce the severity of failures, thereby guaranteeing convergence and avoiding the degradation loops observed in naïve execution‑feedback systems.
The authors evaluate TraceCoder on three widely used benchmarks—BigCodeBench, ClassEval, and HumanEval+—using multiple LLM back‑ends (e.g., GPT‑4, Claude‑2). Compared with strong baselines such as Self‑Debugging and INTERVENOR, TraceCoder achieves up to a 34.43 % relative improvement in Pass@1 accuracy. Ablation studies reveal that the iterative repair loop alone contributes a 65.61 % gain, underscoring the importance of fine‑grained tracing and causal analysis. Moreover, the system reduces redundant repair attempts and improves cost‑efficiency, especially on complex class‑level tasks where LLMs are most error‑prone.
Beyond performance numbers, the paper highlights the modularity and interpretability of the multi‑agent design. Each agent produces explicit logs and rationales, enabling developers and researchers to audit the debugging process, extend individual components, or replace the LLM core without redesigning the entire pipeline. The authors release an open‑source implementation to foster reproducibility and encourage further exploration of history‑aware, trace‑driven automated debugging. In sum, TraceCoder demonstrates that integrating runtime tracing, causal reasoning, and learned debugging experience into a coordinated multi‑agent system can substantially elevate the reliability of LLM‑generated code, moving the field closer to truly self‑sufficient programming assistants.
Comments & Academic Discussion
Loading comments...
Leave a Comment