Deterministic Retrieval at Scale: Optimal-Space LCP Indexing and 308x Energy Reduction on Modern GPUs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study deterministic top-k retrieval under Longest Common Prefix (LCP) similarity for N sequences of length L. We prove a tight Omega(N) space lower bound (cell-probe model) and present a trie-based index using O(N*L) space with O(L+k) query time. We contrast this with pairwise materialization (Theta(N^2)), which hits a practical OOM wall at scale, while our indexed approach remains O(N) in memory. We then introduce Thermal-Aware Logic (TAL), which turns prefix structure into range-bounded scans. In hardware measurements, TAL reduces energy per query by 308x (0.0145 J vs 4.46 J) and cuts p95 latency by 329x (0.114 ms vs 37.5 ms) on a 20M-item range-scan benchmark, while sustaining near-peak utilization (~99%) under long runs. The result is a deterministic retrieval primitive with receipts in regimes where approximate methods are unacceptable.

💡 Research Summary

The paper tackles deterministic top‑k retrieval under the Longest Common Prefix (LCP) similarity metric for a collection of N fixed‑length sequences (each of length L). The authors first establish a tight Ω(N) space lower bound in the cell‑probe model, showing that any data structure supporting top‑k LCP queries must store at least linear‑in‑N cells (or Ω(N·L·log σ) bits, where σ is the alphabet size). They then present a trie‑based index that meets this bound up to a factor of L: the index occupies O(N·L) space (each sequence contributes at most L nodes) and answers a query in O(L + k) time, where k is the number of results requested. The query algorithm descends the trie following the query string until the longest common prefix is exhausted, then performs a breadth‑first search from that node, collecting up to k items. Because the trie construction and traversal are fully deterministic (children are stored in a hash map with a fixed ordering, ties are broken by original index), the same (S, q, k) triple always yields identical results, satisfying strict determinism requirements for safety‑critical systems.

Beyond the algorithmic contribution, the authors introduce Thermal‑Aware Logic (TAL), a hardware‑aware execution strategy that exploits the prefix structure of the trie to limit the amount of data scanned per query. By partitioning the dataset into σ^d “prefix buckets” (where d is a chosen prefix length) and sorting items within each bucket, a query only scans the bucket that matches its prefix, reducing the work per query by a factor of B = σ^d. Theoretical analysis shows that energy consumption scales inversely with B, and the authors connect this reduction to the Landauer limit, arguing that algorithmic pruning narrows the gap between practical energy use and the fundamental thermodynamic minimum.

The experimental evaluation is performed on an NVIDIA H100 GPU (80 GB HBM3). The authors first characterize the out‑of‑memory (OOM) wall for pairwise similarity materialization: at N ≈ 500 k (L = 256) the materialized N × N matrix would require ~466 GiB, exceeding the GPU’s memory, whereas the trie index occupies only ~205 MiB (≈ 0.2 GB). Sustained load tests run for 20 minutes at 266 queries per second, achieving 98.96 % GPU utilization and stable temperature. In a realistic guidance‑navigation‑control (GNC) scenario with 1 000+ sensors operating at >4 kHz, the system processes each step in 0.018 ms using 0.013 J of energy. The TAL technique reduces per‑query energy from 4.46 J (full‑scan baseline) to 0.0145 J—a 308× improvement—and cuts the 95th‑percentile latency from 37.5 ms to 0.114 ms (329× faster). Additional benchmarks on large‑scale inference serving (2 M candidates) and multi‑agent coordination confirm near‑peak GPU utilization (≈99 %) and deterministic behavior.

The paper also discusses conditional hardness: assuming the Orthogonal Vectors Hypothesis (OVH), any algorithm that preprocesses the data in polynomial time cannot answer top‑1 LCP queries in sub‑linear (O(N^{1‑ε})) time, implying that the O(L + k) query bound is essentially optimal for the general case. The authors note that the choice of prefix length d for TAL is data‑distribution dependent; non‑uniform datasets may lead to imbalanced bucket sizes, suggesting future work on adaptive bucket sizing and incremental updates.

In summary, the work delivers a complete stack—from information‑theoretic lower bounds, through optimal‑space trie indexing, to a GPU‑tailored energy‑saving execution model—that enables deterministic, scalable, and energy‑efficient LCP‑based retrieval. This addresses a critical gap for safety‑critical AI systems where approximate, probabilistic methods are unacceptable, and demonstrates that careful algorithm‑hardware co‑design can achieve orders‑of‑magnitude gains in both performance and power consumption.

Deterministic Retrieval at Scale: Optimal-Space LCP Indexing and 308x Energy Reduction on Modern GPUs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment