Evolutionary Context Search for Automated Skill Acquisition
Large Language Models cannot reliably acquire new knowledge post-deployment – even when relevant text resources exist, models fail to transform them into actionable knowledge without retraining. Retrieval-Augmented Generation attempts to bridge this gap by surfacing relevant documents at inference time, yet similarity-based retrieval often fails to identify context that actually improves task performance. We introduce Evolutionary Context Search (ECS), an evolutionary method that searches context combinations using accuracy on a small development set, requiring only inference calls without weight updates. ECS moves beyond semantic similarity to discover non-obvious context pairings that significantly boost performance. Our empirical results show that ECS improves BackendBench by 27% and $τ$-bench airline by 7%. The evolved contexts are model-agnostic, as those evolved with Gemini-3-Flash transfer effectively to Claude Sonnet and DeepSeek. This suggests that ECS opens a path toward automated context discovery for skill acquisition – an efficient alternative to manual prompt engineering or costly fine-tuning.
💡 Research Summary
The paper tackles the problem of updating large language models (LLMs) with new knowledge after deployment, a task that is difficult because traditional fine‑tuning or reinforcement‑learning approaches require access to model weights and are computationally expensive. Retrieval‑augmented generation (RAG) offers a weight‑free alternative by pulling in external documents at inference time, but it relies on semantic similarity between the query and the retrieved passages. This reliance often leads to sub‑optimal context selection: verbose or irrelevant documents are retrieved, ordering matters, and the retrieved text may not actually improve downstream performance.
To address these shortcomings, the authors propose Evolutionary Context Search (ECS), an algorithm that treats the selection of external context as an optimization problem. Instead of using similarity scores, ECS directly measures the utility of a candidate context by running the target LLM on a small development set and observing task performance (accuracy, success rate, etc.). The method proceeds as follows:
-
Context Unit Construction – The raw document collection D is processed into a set of “context units” U. Units come in three flavors: (a) raw source snippets (e.g., full DSL code files), (b) distilled insights (human‑readable rules extracted from error trajectories), and (c) reusable skills (structured procedural modules in an agent‑skill format). This multi‑granular representation allows the algorithm to combine fine‑grained and abstract knowledge as needed.
-
Population Initialization – A population P₀ of N candidate contexts is sampled from U, each context being a concatenation of a fixed number of units so that token limits are respected. Diversity is ensured by sampling without replacement.
-
Fitness Evaluation – For each candidate C, the LLM M (e.g., Gemini‑3‑Flash) is prompted with C as part of the input and evaluated on the development tasks T. The resulting performance metric is normalized to a fitness score s(C) ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment