SVRepair: Structured Visual Reasoning for Automated Program Repair
Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR), yet most existing approaches remain unimodal and fail to leverage the rich diagnostic signals contained in visual artifacts such as screenshots and control-flow graphs. In practice, many bug reports convey critical information visually (e.g., layout breakage or missing widgets), but directly using such dense visual inputs often causes context loss and noise, making it difficult for MLLMs to ground visual observations into precise fault localization and executable patches. To bridge this semantic gap, we propose \textbf{SVRepair}, a multimodal APR framework with structured visual representation. SVRepair first fine-tunes a vision-language model, \textbf{Structured Visual Representation (SVR)}, to uniformly transform heterogeneous visual artifacts into a \emph{semantic scene graph} that captures GUI elements and their structural relations (e.g., hierarchy), providing normalized, code-relevant context for downstream repair. Building on the graph, SVRepair drives a coding agent to localize faults and synthesize patches, and further introduces an iterative visual-artifact segmentation strategy that progressively narrows the input to bug-centered regions to suppress irrelevant context and reduce hallucinations. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance: SVRepair achieves \textbf{36.47%} accuracy on SWE-Bench M, \textbf{38.02%} on MMCode, and \textbf{95.12%} on CodeVision, validating the effectiveness of SVRepair for multimodal program repair.
💡 Research Summary
SVRepair introduces a novel multimodal framework for Automated Program Repair (APR) that explicitly leverages visual artifacts such as screenshots, UI mock‑ups, and control‑flow graphs. The core insight is that many bug reports contain crucial diagnostic information in visual form, yet existing LLM‑based APR systems treat inputs as pure text, leading to loss of context and noisy reasoning. To close this semantic gap, the authors propose two tightly coupled components: a Structured Visual Representation (SVR) model and a coding agent that iteratively refines patches.
The SVR model is a fine‑tuned vision‑language transformer that converts heterogeneous images into a unified intermediate representation called a Semantic Scene Graph (SSG). An SSG is a directed graph where nodes correspond to GUI elements (buttons, input fields) or code blocks (basic blocks, functions) and edges encode relational types such as hierarchical containment, control‑flow, or data‑flow. By serializing the graph in Mermaid syntax, the representation becomes a pure textual artifact that downstream large language models (LLMs) can ingest without additional parsing machinery. Training data are assembled from two major sources: (1) the WebSight dataset, which provides 2 million HTML‑screenshot pairs; each HTML DOM tree is parsed into nodes and hierarchical edges, and (2) a curated collection of 37 high‑rating GitHub repositories where control‑flow graphs are extracted using static analysis tools. The SVR is trained with a standard autoregressive loss to predict the SSG token sequence given the image, achieving high fidelity in image‑to‑graph translation.
The SVRepair agent operates inside an isolated Docker environment that mirrors a real development setup. It is equipped with a toolbox of utilities: grep/glob for code search, read/write/edit for file manipulation, and a bash executor for building and testing. The agent receives the SSG together with the full codebase, then follows a three‑stage loop: (1) Localization – it searches for symbols, error messages, or UI element names extracted from the graph to narrow down candidate files; (2) Generation – it prompts a coding LLM (e.g., GPT‑4‑Turbo) with the localized code context and the SSG description, asking for a concrete patch; (3) Validation – the patch is applied and the test suite is run. If the environment lacks a test harness, the agent automatically synthesizes a minimal “manual” test script to verify the logical correctness of the change.
A key innovation is the iterative visual‑artifact segmentation module. Empirical analysis shows that as the number of visual elements grows, the success rate of patches drops sharply, indicating that irrelevant UI components introduce noise. After a failed validation, the agent extracts error logs and uses them to identify the most bug‑relevant region of the original screenshot. This sub‑artifact is cropped and fed back into the SVR, producing a more focused SSG for the next iteration. The process repeats until a patch passes all tests or a predefined iteration limit is reached. This feedback loop dramatically reduces hallucinations in the LLM and improves fault‑localization precision.
Experimental evaluation spans three benchmarks: SWE‑Bench M (function‑level bugs with textual issue descriptions), MMCode (multimodal code generation from UI screenshots), and CodeVision (visual‑code alignment tasks). SVRepair achieves 36.47 % accuracy on SWE‑Bench M, 38.02 % on MMCode, and an impressive 95.12 % on CodeVision, outperforming prior state‑of‑the‑art unimodal and multimodal APR approaches by 7–12 percentage points. The high performance on CodeVision particularly validates the effectiveness of the SSG as a lossless bridge between visual and code domains.
The paper also discusses limitations. The creation of large‑scale (image, SSG) pairs still requires manual annotation or heuristic parsing, which may not scale to domains beyond web UI and static control‑flow graphs (e.g., animated dashboards, scientific plots). Moreover, dynamic UI states that change at runtime are not captured by a single static graph, potentially limiting applicability to highly interactive applications. Future work could explore video‑to‑graph sequences, self‑supervised graph generation, or integration with reinforcement‑learning agents that learn to request more visual context on demand.
In summary, SVRepair demonstrates that structuring visual information into a graph and feeding it to a code‑focused LLM yields a powerful multimodal APR system. By unifying visual and textual cues, providing an iterative refinement mechanism, and grounding patches in executable test feedback, the approach sets a new benchmark for leveraging UI‑centric bug reports in automated software maintenance.
Comments & Academic Discussion
Loading comments...
Leave a Comment