Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching
Large Language Model (LLM) agents tackle data-intensive tasks such as deep research and code generation. However, their effectiveness depends on frequent interactions with knowledge sources across remote clouds or regions. Such interactions can create non-trivial latency and cost bottlenecks. Existing caching solutions focus on exact-match queries, limiting their effectiveness for semantic knowledge reuse. To address this challenge, we introduce Cortex, a novel cross-region knowledge caching architecture for LLM agents. At its core are two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Seri). A semantic element captures the semantic embedding representation of an LLM query together with performance-aware metadata such as latency, cost, and staticity. Seri then provides two-stage retrieval: a vector similar index with semantic embedding for fast candidate selection and a lightweight LLM-powered semantic judger for precise validation. Atop these primitives, Cortex builds a new cache interface that includes a new semantic-aware cache hit definition, a cost-efficient eviction policy, and proactive prefetching. To reduce overhead, Cortex co-locates the small LLM judger with the main LLM using adaptive scheduling and resource sharing. Our evaluation demonstrates that Cortex delivers substantial performance improvements without compromising correctness. On representative search workloads, Cortex achieves up to a 3.6x increase in throughput by maintaining cache hit rates of over 85%, while preserving accuracy virtually identical to non-cached baselines. Cortex also improves throughput for coding tasks by 20%, showcasing its versatility across diverse agentic workloads.
💡 Research Summary
The paper addresses a critical bottleneck in large‑language‑model (LLM) agents: the high latency and monetary cost incurred when these agents repeatedly call external knowledge sources (search APIs, private RAG back‑ends, etc.) across data‑center regions. Existing caching mechanisms either focus on exact key‑value matches (traditional KV caches, file‑system caches) or on reusing LLM outputs via prompt similarity (semantic prompt caches). Both approaches fall short for agentic workloads because queries are naturally expressed in language, often semantically similar but not textually identical, and because agents must respect cost, latency, and rate‑limit constraints of remote tools.
Cortex introduces a novel “semantic‑aware remote knowledge caching” paradigm built around two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Seri). An SE packages the embedding of an agent’s query together with performance‑aware metadata such as observed latency, monetary cost, and staticity (how frequently the underlying data changes). This metadata enables the cache to make cost‑effective decisions, e.g., preferring to reuse highly static, cheap‑to‑fetch items while discarding volatile, expensive ones.
Seri implements a two‑stage retrieval pipeline. Stage 1 uses an Approximate Nearest Neighbor (ANN) index to quickly retrieve a high‑recall set of candidate SEs based on embedding similarity. Stage 2 passes these candidates to a lightweight LLM‑powered “Semantic Judge” that validates whether the candidate truly satisfies the current context. The judge is deliberately small (few hundred million parameters) and runs with a constrained prompt, keeping its inference latency to 1‑2 ms while dramatically reducing false positives that plague pure vector‑similarity approaches.
Cortex redefines a cache hit: a hit occurs only when a candidate passes the semantic judge and its metadata meets the current cost/latency/staticity thresholds. This prevents stale or expensive data from being served simply because it is semantically close. The eviction policy blends Least‑Recently‑Used (LRU) with a cost‑aware scoring function that weights staticity and observed API cost, ensuring that high‑value items stay longer. Moreover, Cortex exploits the Zipf‑like popularity distribution and bursty, correlated query patterns observed in both web‑search‑driven agents and code‑generation agents. By analyzing SE access frequencies and temporal spikes, Cortex proactively prefetches likely‑to‑be‑needed SEs, further reducing remote calls during burst periods.
Implementation-wise, Cortex colocates the main agent LLM and the semantic judge on the same GPU, using an adaptive priority scheduler that shields the agent’s critical inference path from judge interference. This co‑location avoids extra network hops and keeps additional GPU memory overhead modest.
The evaluation covers two representative workloads: a search‑oriented agent (Search‑R1) and a coding agent operating on the SWE‑Bench suite. On the search workload, Cortex achieves up to 3.6× higher throughput compared with a naïve exact‑match cache, while maintaining >85 % hit rates and virtually identical answer accuracy to a non‑cached baseline. The semantic judge eliminates the accuracy drop seen in prior semantic caches. For coding tasks, Cortex delivers a 20 % throughput boost, confirming that semantic matching of file‑level requests (e.g., “load the SQL parser”) is effective. Cost simulations show that reducing remote API calls can save millions of dollars per month for large‑scale deployments, and the adaptive eviction/pre‑fetch mechanisms help respect API rate limits that would otherwise throttle agents.
In summary, Cortex demonstrates that a cache built on semantic similarity, validated by a lightweight LLM, and guided by cost/latency metadata can dramatically alleviate the latency and cost challenges of remote data access in LLM agents without sacrificing correctness. The paper opens avenues for future work on multi‑model cooperation, dynamic metadata learning, and automated placement strategies across heterogeneous cloud environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment