COLT: Lightweight Multi-LLM Collaboration through Shared MCTS Reasoning for Model Compilation

COLT: Lightweight Multi-LLM Collaboration through Shared MCTS Reasoning for Model Compilation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Model serving costs dominate AI systems, making compiler optimization essential for scalable deployment. Recent works show that a large language model (LLM) can guide compiler search by reasoning over program structure and optimization history. However, using a single large model throughout the search is expensive, while smaller models are less reliable when used alone. Thus, this paper seeks to answer whether multi-LLM collaborative reasoning relying primarily on small LLMs can match or exceed the performance of a single large model. As such, we propose a lightweight collaborative multi-LLM framework, dubbed COLT, for compiler optimization that enables coordinated reasoning across multiple models within a single Monte Carlo tree search (MCTS) process. A key contribution is the use of a single shared MCTS tree as the collaboration substrate across LLMs, enabling the reuse of transformation prefixes and cross-model value propagation. Hence, we circumvent both heavy internal reasoning mechanisms and conventional agentic machinery that relies on external planners, multiple concurrent LLMs, databases, external memory/versioning of intermediate results, and controllers by simply endogenizing model selection within the lightweight MCTS optimization loop. Every iteration, the acting LLM proposes a joint action: (compiler transformation, model to be queried next). We also introduce a model-aware tree policy that biases search toward smaller models while preserving exploration, and a course-alteration mechanism that escalates to the largest model when the search exhibits persistent regressions attributable to smaller models.


💡 Research Summary

The paper tackles the growing cost of model serving by improving compiler optimization, a critical step for efficient deployment of neural workloads. Recent works have shown that large language models (LLMs) can guide compiler searches through contextual reasoning over program structure and optimization history, but relying on a single large LLM is expensive, while smaller models alone are unreliable. The authors ask whether a collaborative approach that primarily uses small LLMs can match or surpass a single large model without the heavyweight machinery of agentic systems.

To answer this, they introduce COLT (Collaborative LLM reasoning via shared Tree), a lightweight framework that integrates multiple LLMs into a single Monte‑Carlo Tree Search (MCTS) process. The key innovation is treating model selection as a first‑class decision within the MCTS. Each node in the tree represents a joint state ⟨program, current model⟩. When a node is selected, the associated LLM proposes a joint action ⟨transformation, next model⟩, thereby expanding the tree along both the transformation and model dimensions. This “endogenous model selection” embeds the routing of LLMs directly into the long‑horizon optimization objective.

To bias the search toward cheaper models while preserving exploration, the authors design a model‑aware tree policy that modifies the classic UCT formula with a model‑size weight, favoring smaller LLMs. They also add a “course‑alteration” mechanism that escalates to the largest LLM when persistent regressions are detected, ensuring robustness. Because all models share the same tree, transformation prefixes are reused, and value estimates from downstream program variants are back‑propagated through the shared structure. Consequently, knowledge discovered by one model informs the decisions of others, a crucial advantage for compiler optimization where transformation sequences have compounding and long‑range effects.

Experiments on five modern neural network benchmarks (including Transformer and ResNet variants) run on both CPU and GPU platforms demonstrate that COLT consistently outperforms a single‑large‑LLM baseline. Averaged across benchmarks, COLT achieves a 10.86× speedup on CPU and 30.05× on GPU for its best configuration, while invoking the largest LLM in only 23.9% of total calls. This shows that a majority of the optimization can be driven by inexpensive models without sacrificing quality.

The paper acknowledges limitations: early reliance on small models may propagate sub‑optimal transformations, the textual prompt interface can become costly as the candidate set grows, and the current approach does not dynamically adjust model trustworthiness. Future directions include encoding transformation candidates as structured graphs, employing meta‑learning to adapt model selection policies, and integrating more sophisticated cost‑aware scheduling. Overall, COLT presents a compelling, low‑overhead method for multi‑LLM collaboration in compiler optimization, offering substantial cost savings and performance gains.


Comments & Academic Discussion

Loading comments...

Leave a Comment