From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the expensive model, failing to adapt during inference. To address this, we propose Inter-Cascade, an online, interactive framework that transforms the strong model from a temporary helper into a long-term teacher. In our approach, when the strong model resolves a deferred query, it generates a generalized, reusable problem-solving strategy. These strategies are stored in a dynamic repository and retrieved via similarity matching to augment the weak model’s context for future queries. This enables the weak model to learn on the job without expensive parameter fine-tuning. We theoretically show that this mechanism improves the weak model’s confidence calibration. Empirically, Inter-Cascade outperforms standard cascades on multiple benchmarks, improving weak model and overall system accuracy by up to 33.06 percent and 6.35 percent, while reducing strong model calls by up to 48.05 percent and saving fee by up to 49.63 percent. Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a general, scalable framework applicable to both open-source and API-based LLMs.

💡 Research Summary

The paper addresses a fundamental inefficiency in existing large language model (LLM) cascades: they are “memory‑less.” In a typical cascade, a cheap weak model handles easy queries and defers only those whose confidence score falls below a threshold λ to a more capable but expensive strong model. After the strong model produces an answer, its reasoning process is discarded, so when similar or repeated queries appear later the system again invokes the strong model, wasting compute and cost. This problem is especially acute in real‑world workloads that exhibit a “similarity phenomenon,” where many queries are variations of the same underlying task (e.g., math problems with slight parameter changes).

Inter‑Cascade proposes an online, interactive framework that turns the strong model into a long‑term teacher. When the strong model resolves a deferred query, it simultaneously generates a strategy – a token sequence that includes the original query, the correct answer, and a generalized problem‑solving outline (key ideas, steps, or heuristics) that can be applied to semantically similar future queries. Each (query, strategy) pair is stored in a dynamic Strategy Repository (Repo). For any incoming query, the weak model first retrieves the top‑k most similar strategies from Repo using a similarity function (e.g., cosine similarity over embeddings). These retrieved strategies are concatenated with the query to form an augmented prompt, which is then fed to the weak model’s confidence estimator and generator. If the weak model’s confidence now exceeds λ, it answers locally; otherwise the query is still forwarded to the strong model, which again produces a new strategy that is added to Repo.

The key technical contributions are:

Generalizable Knowledge Transfer: Unlike simple caching of exact answers, strategies capture abstract reasoning patterns, enabling reuse across variations of a task.
Theoretical Guarantees: The authors extend the calibration framework of Jung et al. (2025). They prove that, under reasonable assumptions (the number of queries passing the confidence threshold grows by a factor b ≥ 1 and the error rate shrinks by a factor ε ∈ (0,1]), the risk tolerance α in the calibrated guarantee strictly decreases. Theorem 2.2 and its corollary quantify this reduction using normal approximations.
Negligible Overhead: Storing strategies is lightweight (a few hundred tokens per entry). Retrieval over a million embeddings (384‑dimensional) requires only 0.2–0.8 ms and modest GPU/CPU memory (≈70–80 MB VRAM, 80–100 MB RAM), making the approach feasible on commodity hardware and even mobile devices.

Empirically, the authors evaluate Inter‑Cascade on eight diverse benchmarks, focusing on four representative datasets spanning reasoning‑intensive tasks (e.g., GSM‑Symbolic, MathQA) and knowledge‑heavy QA (e.g., HotpotQA). Using a two‑model cascade (weak M₁, strong M₂), they compare against the state‑of‑the‑art cascade of Jung et al. (2025). Results show:

Weak‑model accuracy improves by up to 33.06 %.
Overall system accuracy (including strong‑model answers) improves by an average of 6.35 %.
Calls to the strong model drop by up to 48.05 %, translating into ≈49.63 % cost savings.

The gains are most pronounced on tasks with high query similarity, where the strategy repository quickly accumulates useful patterns, allowing the weak model to solve increasingly complex instances without deferral.

In summary, Inter‑Cascade introduces a scalable, online in‑context knowledge distillation mechanism that converts a strong LLM into a continual teacher. By automatically generating and reusing abstract problem‑solving strategies, it enhances the weak model’s confidence calibration, reduces reliance on expensive inference, and does so without any parameter fine‑tuning. The framework is model‑agnostic and works for both open‑source and API‑based LLMs, offering a practical path toward cost‑effective, high‑performance language‑model services.

From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

💡 Research Summary

Comments & Academic Discussion

Leave a Comment