AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.
💡 Research Summary
The paper tackles the high inference cost that plagues evolutionary AI agents, especially those that repeatedly call large language models (LLMs) to iteratively generate and refine code. Existing model‑cascade or speculative decoding approaches rely on static heuristics or external controllers and do not exploit the model’s own uncertainty signals. AdaptEvolve introduces a lightweight, uncertainty‑driven routing mechanism that decides, at each mutation step, whether to keep using a small, cheap model (MS, e.g., 4 B parameters) or to “escalate” to a larger, more capable model (ML, e.g., 32 B parameters).
The core idea is to compute four token‑level confidence metrics from the small model’s output: Mean Confidence (MC), Lowest Group Confidence (LGC), Tail Confidence (TC), and Bottom‑K% Confidence (BWC). These metrics capture global, local, and end‑of‑sequence uncertainty, providing a rich feature vector C(x) for each candidate solution x. A binary classifier Φ(C) ∈ {0, 1} predicts solvability: Φ = 0 keeps MS, Φ = 1 switches to ML.
Training proceeds in two stages. First, a minimal warm‑up phase (N = 50 samples) collects labeled pairs (C(x), y) where y indicates whether the large model would solve the instance. A shallow decision tree (Gini impurity, max depth = 5) is fit to these data, yielding an initial routing rule that captures non‑linear interactions such as low MC but high LGC. Second, during the evolutionary search, an Adaptive Hoeffding Tree (HAT) continuously updates split criteria and prunes branches when concept drift is detected, ensuring the router adapts to the evolving difficulty distribution of the population.
The mutation operator is thus redefined as:
x′ = MS(x) if Φ(C(x)) = 0, otherwise x′ = ML(x).
This creates a dynamic computation graph where cheap generations handle routine steps, and expensive generations are invoked only for high‑entropy, hard‑to‑solve steps.
Experiments were conducted on two code‑generation benchmarks: LiveCodeBench v5 (880 samples) and MBPP (974 samples). Cost was normalized so that one call to the 32 B model equals 1 unit and one call to the 4 B model equals 0.125 units. The authors compare AdaptEvolve against several baselines: pure small model, pure large model, random routing, static decision‑tree routing, and a classic cascade baseline (Chen et al., 2023).
Key results:
- On LiveCodeBench, AdaptEvolve achieved 73.6 % accuracy (97.9 % of the 32 B upper bound) while using a 42 : 58 small‑to‑large ratio, reducing total compute by 34.4 % and attaining an efficiency score of 35.4 (vs. 23.7 for the pure large model).
- On MBPP, the router identified that 85 % of queries could be solved by the small model, cutting compute by 41.5 % and preserving 97.1 % of peak accuracy, yielding an efficiency of 132.3 (nearly double the pure large model’s 79.7).
- The adaptive Hoeffding Tree consistently outperformed the static decision tree, improving accuracy by 2.4 points on LiveCodeBench by adjusting thresholds as the population evolved toward harder edge cases.
- Overall, across benchmarks, AdaptEvolve reduced inference cost by an average of 37.9 % while retaining 97.5 % of the upper‑bound accuracy, establishing a superior Pareto frontier.
Limitations are acknowledged: the method requires access to token‑level log‑probabilities, which are unavailable in many closed‑source APIs, restricting deployment to open‑weight models or services that expose log‑probs. The confidence metrics were tuned for code generation; extending to other domains (e.g., summarization, QA) may need metric redesign. Moreover, overly aggressive escalation to the large model could erode cost savings, suggesting future work on multi‑objective optimization of routing policies.
In conclusion, AdaptEvolve demonstrates that intrinsic generation uncertainty is a reliable, low‑overhead signal for dynamic model selection in evolutionary agentic systems. By coupling uncertainty‑driven routing with lightweight online learning, the framework achieves substantial computational savings without sacrificing solution quality, offering a practical pathway toward scalable, cost‑effective AI agents that can judiciously balance the trade‑off between model capability and inference efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment