Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs
Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We introduce SEAM (Structured Experience Adapter Module), a lightweight, executor-specific plug-in that stores experience in its parameters and generates a structured, instance-tailored experience entry in a single forward pass to guide a frozen LLM executor. SEAM is trained for utility via executor rollouts and GRPO while keeping the executor frozen, and it can be further improved after deployment with supervised fine-tuning on logged successful trajectories. Experiments on mathematical reasoning benchmarks show consistent accuracy gains across executors with low overhead. Extensive ablations and analyses further elucidate the mechanisms underlying SEAM’s effectiveness and robustness.
💡 Research Summary
The paper tackles a fundamental limitation of large language models (LLMs): once deployed they operate statically, solving each new problem from scratch without reusing useful knowledge from past interactions. Existing “experience‑reuse” methods rely on retrieval‑augmented generation (RAG). In a typical RAG pipeline, past trajectories are summarized into structured snippets, stored in an external memory bank, and at inference time a similarity‑based retriever fetches a few candidates which are then possibly re‑written before being appended to the prompt. This approach suffers from three major drawbacks. First, similarity does not guarantee utility; retrieved snippets can be noisy, miss critical constraints, or even destabilize reasoning. Second, the retrieval, summarization and extra LLM calls introduce non‑trivial latency and computational overhead. Third, maintaining an external memory (indexing, deduplication, schema design) requires continual engineering effort.
To overcome these issues, the authors propose SEAM (Structured Experience Adapter Module), a lightweight, executor‑specific plug‑in that internalizes a “structured experience library” directly in its parameters. For a given downstream executor LLM Eϕ (e.g., a math‑reasoning model, a code‑generation model, etc.) and an input problem s, SEAM Aθ generates, in a single forward pass, a short, schema‑constrained experience prompt z. The schema consists of three components: (1) Problem analysis – a concise diagnosis of the instance, highlighting difficulty and likely failure modes; (2) Experience highlights – distilled, executor‑aligned heuristics or checks that have helped in prior rollouts; (3) Reference plan – a step‑by‑step procedural outline that demonstrates a reliable solving workflow without revealing the final answer. The generated z is then concatenated to the original prompt and fed to the frozen executor Eϕ, which solves the task unchanged.
Training SEAM follows a forward‑learning loop that treats SEAM as a guidance policy whose quality is judged by the downstream executor’s success. The loop consists of three steps:
-
Forward Exploration – For each training instance s, SEAM samples K candidate experience entries {z_j} from its current policy πθ.
-
Rollout‑Based Evaluation – The frozen executor Eϕ is conditioned on each candidate z_j and runs M stochastic rollouts, producing answers â̂_j,m. A binary reward R is assigned: 1 if the answer is correct and the candidate z_j satisfies the schema (i.e., is structurally complete), otherwise 0. The average reward eR_j across the M rollouts quantifies the utility of candidate z_j.
-
Parametric Library Evolution (GRPO) – Using the group of rewards {eR_j}, SEAM computes group‑relative advantages A_j = eR_j – (mean reward) normalized by the group’s variance plus a small δ. A PPO‑style clipped objective L_GRPO(θ) combines the advantage‑weighted likelihood ratio with a KL‑penalty that keeps the updated policy close to a fixed reference policy (the initial SEAM). Crucially, gradients flow only through SEAM; the executor Eϕ remains frozen throughout training. This decoupled optimization enables SEAM to continuously improve from execution feedback without risking catastrophic forgetting in the executor.
An optional post‑deployment phase logs successful (s, z*) pairs—instances where the frozen executor solved the problem correctly under SEAM’s guidance. Periodically, SEAM undergoes supervised fine‑tuning (SFT) on this buffer using teacher‑forcing, further internalizing concrete successful experience without touching the executor.
Experiments are conducted on four mathematical reasoning benchmarks (GSM8K, MATH, AIME‑24, AIME‑25) and on four cross‑domain tasks (CodeContests, MBPP, HotpotQA, Natural Questions). Baselines span three families: (i) the original frozen executor without any augmentation; (ii) direct training of the executor with GRPO (i.e., baking experience into the executor’s weights); (iii) RAG‑style methods such as MEM‑0 and Dynamic‑Cheatsheet. Across all settings, SEAM consistently yields 2–5 percentage‑point gains in pass@1 accuracy over the best baseline, while incurring negligible inference overhead (≈0.1–0.3 GFLOP, < 50 ms latency increase). Ablation studies demonstrate that (a) the schema‑constrained generation is essential—free‑form text leads to lower utility; (b) the group‑relative advantage formulation outperforms plain REINFORCE; (c) executor‑specific SEAM modules outperform a shared SEAM across multiple executors, confirming the benefit of tailoring experience to the solver’s inductive biases.
Key insights:
- Encoding experience in model parameters eliminates the need for external storage, indexing, and similarity‑based retrieval, thereby removing a major source of latency and engineering complexity.
- Training SEAM with the executor’s actual success as the reward aligns the generated guidance with true utility, overcoming the “similarity ≠ usefulness” problem of conventional RAG.
- The three‑part schema forces the generated experience to be diagnostic, prescriptive, and procedural, which makes it safe (doesn’t leak answers) and broadly applicable across problem instances.
- Keeping the executor frozen preserves its general capabilities and stability, while SEAM can be hot‑swapped or continually updated via logged SFT, enabling lifelong learning without risking regressions in the base model.
In summary, SEAM introduces a novel paradigm for experience reuse in LLM systems: a lightweight, trainable adapter that internalizes a structured experience library, generates utility‑optimized guidance in a single forward pass, and steers a frozen executor toward higher accuracy with minimal computational cost. This work opens a path toward scalable, low‑latency, and continuously improvable LLM deployments that can learn from their own past successes without the burdens of external memory management.
Comments & Academic Discussion
Loading comments...
Leave a Comment