Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair
Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately these patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing patches unlikely to be accepted can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce two complementary LLM-based policies to reduce such noise: bug abstention and patch validation policies. Bug abstention excludes bugs that the agentic APR system is unlikely to fix. Patch validation rejects patches that are unlikely to be a good fix for the given bug. We evaluate both policies on three sets of bugs from Google’s codebase, and their candidate patches generated by an internal agentic APR system. On a set of 174 human-reported bugs, removing bugs and patches rejected by our policies can raise success rates by up to 13 percentage points and 15 percentage points, respectively, and by up to 39 percentage points in combination. On null pointer exceptions and sanitizer-reported bugs with machine-generated bug reports, patch validation also improves average single-sample success rates. This two-policy approach provides a practical path to the reliable, industrial-scale deployment of agentic APR systems.
💡 Research Summary
The paper tackles a practical obstacle in deploying agentic Automated Program Repair (APR) systems at industrial scale: the “noise” generated when developers are presented with patches that are unlikely to be correct or useful. While prior APR research has focused on overall success metrics such as pass@k, real‑world adoption hinges on the quality of the subset of patches that actually reach a human reviewer. To address this, the authors propose a two‑stage filtering framework built around large language models (LLMs).
1. Bug Abstention Policy
The first stage decides, before any repair attempt, whether a given bug is worth attempting to fix. The policy receives only the bug report (title, description, and metadata) and queries an LLM with an instruction‑style prompt that asks the model to output a binary “success” or “failure” token. The model’s token‑level log probabilities are interpreted as a probability score Pₐ(b). By comparing Pₐ(b) to a configurable threshold τ, the system either proceeds with the repair agent (Attempt Repair) or skips the bug entirely (Abstain). Two prompt variants are explored: a plain representation and one enriched with manually crafted guidelines (e.g., “Clear problem & action”, “Precise code localization”). Importantly, the abstention model does not have access to the codebase, keeping it lightweight and avoiding unnecessary compute.
2. Patch Validation Policy
If a repair attempt produces a candidate patch, the second stage evaluates its correctness. Validation proceeds in three layers: (a) deterministic build and test regression checks that discard patches failing known compilation or reproduction tests; (b) a “Fix Specification” generation step where an LLM, given the bug report and the original source files edited by the agent, produces a natural‑language specification of what a correct fix should look like; (c) a second LLM call that consumes the specification, the unified diff of the candidate patch, and any test execution logs, and returns a triplet (Pᵥ, explanation, confidence). The binary judgment Pᵥ indicates whether the patch is likely correct; the confidence score is derived from the exponential of average token log probabilities, while the explanation offers human‑readable rationale. A variant that skips the specification and directly feeds the entire agent dialogue into the LLM is also evaluated.
3. Experimental Evaluation
The authors evaluate both policies on three real‑world bug sets from Google’s codebase:
- Human‑reported bugs (174) with known ground‑truth patches and fail‑to‑pass test cases.
- Machine‑reported Java Null‑Pointer Exceptions (198) collected from a live deployment.
- Sanitizer‑reported bugs (50) with reproducible tests.
Metrics focus on “filtered success@k”, i.e., success rates computed only on the subset of bugs/patches that survive the policies. For the human‑reported set, baseline fail‑to‑pass@1 is 0.11. Applying bug abstention alone raises it to 0.21; patch validation alone to 0.29. When combined, a moderate configuration yields 0.35, while a stricter funnel reaches 0.53—more than a 1‑in‑2 chance of presenting a correct patch, albeit for fewer bugs. For the NPE set, patch validation improves accept@1 from 0.38 to 0.62; for sanitizer bugs, gains of up to 15 percentage points are observed.
4. Insights and Contributions
- The two policies are complementary: abstention filters out inherently hard bugs, reducing wasted repair attempts, while validation weeds out low‑quality patches that slip through.
- Even without ground‑truth fixes, an LLM can generate a useful “fix specification” that guides downstream judgment.
- The framework dramatically improves the developer‑facing success rate without requiring any additional test or oracle information beyond what is already available in the bug report.
- The approach demonstrates a shift in how LLMs are used in APR: not merely as code generators, but as decision‑making and quality‑control agents.
5. Limitations and Future Work
- LLM‑derived probability scores are not perfectly calibrated to empirical success rates; however, ranking quality suffices for abstention.
- The fix specification may be erroneous, potentially leading the validator to accept a bad patch; mechanisms for specification verification or human‑in‑the‑loop refinement are needed.
- Experiments are confined to Google’s internal codebase and a specific LLM; broader generalization to other languages, domains, or open‑source repositories remains an open question.
- Future directions include probability calibration techniques, richer multi‑modal representations (e.g., incorporating static analysis), and exploring how human feedback can iteratively improve both policies.
6. Conclusion
By introducing a pre‑repair bug‑abstention filter and a multi‑stage LLM‑driven patch validator, the authors provide a practical, scalable solution to reduce noise in agentic APR systems. Their extensive industrial evaluation shows that the combined approach can raise the filtered success rate by up to 39 percentage points, making automated repair tools far more trustworthy and efficient for developers. This work paves the way for treating LLMs as both creators and judges of code changes, a paradigm shift that could accelerate the adoption of AI‑assisted software maintenance in real‑world settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment