Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimization
Large language models and AI agents have recently shown promise in automating software performance optimization, but existing approaches predominantly rely on local, syntax-driven code transformations. This limits their ability to reason about program behavior and capture whole system performance interactions. As modern software increasingly comprises interacting components - such as microservices, databases, and shared infrastructure - effective code optimization requires reasoning about program structure and system architecture beyond individual functions or files. This paper explores the feasibility of whole system optimization for microservices. We introduce a multi-agent framework that integrates control-flow and data-flow representations with architectural and cross-component dependency signals to support system-level performance reasoning. The proposed system is decomposed into coordinated agent roles - summarization, analysis, optimization, and verification - that collaboratively identify cross-cutting bottlenecks and construct multi-step optimization strategies spanning the software stack. We present a proof-of-concept on a microservice-based system that illustrates the effectiveness of our proposed framework, achieving a 36.58% improvement in throughput and a 27.81% reduction in average response time.
💡 Research Summary
The paper addresses a critical gap in current large‑language‑model (LLM) driven code optimization: most existing approaches operate on a local, syntax‑centric level, optimizing individual functions or classes without considering the broader system context. Modern applications, especially those built as microservice architectures, exhibit performance characteristics that emerge from interactions among services, databases, and shared infrastructure. To tackle this, the authors propose a multi‑agent framework that elevates performance reasoning to the whole‑system level by integrating static program analysis with architectural information.
The framework consists of four coordinated agents that form a pipeline: Summarization, Analysis, Optimization, and Performance Evaluation. The Summarization stage is split into three specialized agents: a Component Summarization Agent, a Behavior Summarization Agent, and an Environment Summarization Agent. Using CodeQL as the static analysis engine, the Component agent extracts a hierarchical view of services, packages, classes, and methods, along with static dependencies such as call‑based, type‑based, and resource‑based couplings. The Behavior agent builds inter‑procedural call graphs, captures control‑flow complexity, identifies database access patterns, and detects synchronization constructs. The Environment agent records build‑time and deployment‑time settings (compiler flags, runtime parameters, external libraries) to make non‑code constraints explicit.
The Analysis Agent consumes the structured summaries and performs a three‑step reasoning process. First, it extracts performance signals (e.g., high‑frequency calls, deep call stacks, inefficient query patterns). Second, it maps these signals to concrete code locations. Third, it ranks optimization opportunities by estimated impact and confidence, producing a detailed report with source file references, risk assessments, and expected gains.
Guided by this report, the Optimization Agent leverages an LLM (GPT‑5.2) to generate concrete, verifiable code and configuration changes. Crucially, the agent enforces “non‑breaking” constraints: public APIs and service interfaces must remain unchanged, and any modification must pass the existing test suite. Each generated patch is accompanied by a justification that links the change to the identified performance issue.
The Performance Evaluation Agent validates both functional correctness and performance impact. Functional validation uses the application’s JUnit test suite; any failing patch is discarded. For performance validation, the agent runs dynamic profiling under realistic workloads using Apache JMeter, measuring latency, throughput, CPU, and memory usage. Results are fed back to the Optimization Agent, enabling an iterative refinement loop that stops when no further meaningful gains are observed.
Implementation details: the prototype is built in Python, orchestrated with LangGraph to model the agents as nodes in a directed graph, and LangSmith for tracing model outputs and intermediate states. CodeQL queries are customized for Java to extract the required architectural and behavioral artifacts, which are serialized into a language‑agnostic JSON format for downstream processing.
The authors evaluate the framework on TeaStore, a Java‑based microservice benchmark consisting of six inter‑communicating services that model an online retail system. Experiments run on a dedicated Xeon W‑2295 server, using GPT‑5.2 with a temperature of 0.7 for all agents. After the full optimization cycle, the system exhibits a 36.58 % increase in throughput and a 27.81 % reduction in average response time, while all functional tests continue to pass. Resource utilization also improves modestly (≈12 % CPU reduction).
Contributions of the paper are threefold: (1) a novel system‑level summarization model that fuses control‑flow, data‑flow, and architectural dependencies; (2) a coordinated multi‑agent pipeline that systematically transforms high‑level performance insights into verified code changes; (3) a proof‑of‑concept demonstration showing substantial performance gains on a realistic microservice application.
The paper acknowledges several limitations. The reliance on static analysis means dynamic runtime behaviors (e.g., JIT optimizations, garbage collection pauses) are not fully captured. The current prototype separates the evaluation stage from the main pipeline, leaving full end‑to‑end automation as future work. Moreover, the quality of generated optimizations heavily depends on prompt engineering and domain knowledge embedded in the LLM. Future directions include integrating dynamic tracing and machine‑learning‑based cost models to improve impact estimation, extending the framework to support continuous integration/continuous deployment (CI/CD) environments for real‑time optimization, and exploring richer multimodal agents (e.g., incorporating profiling data, logs, and trace spans) to close the gap between static predictions and observed performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment