Dynamic Superblock Pruning for Fast Learned Sparse Retrieval
This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.
💡 Research Summary
The paper introduces a novel dynamic pruning technique called Superblock Pruning (SP) for learned sparse retrieval, aiming to accelerate top‑k document ranking while preserving relevance guarantees. Traditional dynamic pruning methods operate at the level of individual postings or flat document blocks, computing an upper bound on each block’s score and discarding blocks whose bound falls below the current top‑k threshold. While effective, these approaches still require traversing many blocks, especially when block sizes are small, limiting their speed gains.
SP extends this idea by adding a second hierarchical level: superblocks. The document collection is first partitioned into uniform blocks of size b documents (e.g., b = 8 or 16). Consecutive blocks are then grouped into superblocks containing c blocks (c = 64 in the experiments). For each block and superblock, the maximum term weight for every term t in the vocabulary is pre‑computed, together with the average maximum weight for superblocks. These statistics are stored compactly (8‑bit for maxima, 16‑bit for averages), adding roughly 1–2 GB to a 37 GB inverted index.
During query processing, SP proceeds in three stages:
-
Superblock‑level pruning – For the query Q, SP computes two bounds for each superblock X: a maximum bound Smax(X) = ∑{t∈Q} q_t·W{X,t} and an average bound Savg(X) = ∑{t∈Q} q_t·\overline{W}{X,t}. Two user‑tunable parameters μ and η (0 < μ ≤ η ≤ 1) define safe pruning thresholds: a superblock is discarded if Smax(X) ≤ μ·θ or Savg(X) ≤ η·θ, where θ is the current top‑k heap threshold.
-
Block‑level pruning – For the remaining superblocks, each constituent block B is evaluated using the classic block bound BoundSum(B) ≤ η·θ. Blocks that fail this test are pruned without scoring any of their documents.
-
Document scoring – Only unpruned blocks are traversed, and documents are scored using a forward‑index approach identical to BMP. The final top‑k list is assembled from these scores.
A key engineering contribution is the cache‑friendly computation of the bounds. The authors compare two accumulation strategies: (a) term‑at‑a‑time, which repeatedly accesses scattered block metadata, and (b) superblock‑at‑a‑time, which processes all terms for a single superblock before moving on. The latter (Option 2) yields up to 1.89× faster bound computation due to better L1 cache reuse and SIMD vectorization, and it is the default in SP.
Theoretical analysis shows that SP satisfies a μ‑competitiveness property: the average score of the top‑k′ (k′ ≤ k) results returned by SP is at least μ times the average score of any rank‑safe algorithm. Under an i.i.d. assumption for document scores within a superblock, the η‑bound provides a probabilistic safety guarantee analogous to ASC’s probabilistic pruning, but without requiring random segmentation of blocks.
Experiments are conducted on the MS MARCO Passage dataset (8.8 M passages) using two learned sparse models: SPLADE and Efficient‑SPLADE. Baselines include BMP, ASC, Seismic, and MaxScore (PISA). All methods run on a single thread of an Intel i7‑1260P with AVX2 instructions; SP is compiled with Rust 1.84 –O3. The evaluation measures mean latency (ms), MRR@10, Recall@k (k = 10 and 1000), and nDCG@10.
Results demonstrate substantial speedups:
- For k = 10 at a 99 % recall budget, SP achieves 0.629 ms latency (MRR ≈ 37.7), compared to BMP’s 1.44 ms (MRR ≈ 38.1) – a 32 % reduction.
- For k = 1000 at the same recall budget, SP’s latency is 1.74 ms (Recall ≈ 97.9 %) versus BMP’s 4.99 ms (Recall ≈ 98.2 %) – a 2.9× speedup.
- Compared to ASC, SP is 3.3× faster for both k = 10 and k = 1000 under identical recall constraints.
- Against Seismic, which uses aggressive static index pruning, SP remains faster in the full‑index setting (e.g., 2.9× faster at 99 % recall) while preserving rank‑safety.
A detailed breakdown shows that superblock pruning accounts for roughly 30 % of total time, block pruning about 20 %, and actual document scoring the remaining 50 % for SP. In BMP, block pruning consumes a larger fraction, especially when block size b is reduced, leading to higher overhead. Varying b from 128 down to 8 demonstrates that SP’s two‑level pruning scales gracefully, whereas BMP’s performance degrades sharply for small b.
Memory overhead is modest: with c = 64, b = 8, the extra superblock metadata occupies ~2 GB, raising the total uncompressed index size from 37 GB (BMP) to 39 GB (SP). Seismic’s static pruning yields a smaller 13 GB index but sacrifices rank‑safety.
The authors acknowledge limitations. Superblock effectiveness depends on the quality of the underlying document clustering (Bipartite Partitioning). Poor clustering can inflate the superblock bound, reducing pruning opportunities. The probabilistic safety guarantee assumes i.i.d. scores within a superblock, which may not hold in practice. Moreover, the current implementation targets AVX2; leveraging AVX‑512 or newer prefetching techniques could further improve throughput.
In conclusion, Dynamic Superblock Pruning offers a practical and theoretically sound enhancement to learned sparse retrieval. By introducing a hierarchical pruning hierarchy, it reduces the number of block and document evaluations, achieves significant latency reductions on commodity CPUs, and maintains strong relevance guarantees. The work opens avenues for future research on adaptive clustering, tighter probabilistic models, and deeper hardware‑level optimizations to bring learned sparse retrieval closer to real‑time production use cases.
Comments & Academic Discussion
Loading comments...
Leave a Comment