LEMUR: Learned Multi-Vector Retrieval

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding for each token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved recall of multi-vector retrieval comes at the expense of significantly increased latency. This necessitates designing efficient approximate nearest neighbor search (ANNS) algorithms for multi-vector search. In this work, we introduce LEMUR, a simple-yet-efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, which enables the use of existing single-vector ANNS methods for speeding up retrieval. In addition to performance evaluation on ColBERTv2 embeddings, we evaluate LEMUR on embeddings generated by modern multi-vector text models and multi-vector visual document retrieval models. LEMUR is an order of magnitude faster than earlier multi-vector similarity search methods.

💡 Research Summary

The paper introduces LEMUR (Learned Multi‑Vector Retrieval), a framework that dramatically speeds up retrieval with late‑interaction (multi‑vector) models such as ColBERT‑v2 while preserving their superior recall. Multi‑vector models encode each token of a query and a document as a separate embedding and compute similarity with the MaxSim (or Chamfer) measure: MaxSim(X, C) = ∑{x∈X} max{c∈C}⟨x,c⟩. This fine‑grained representation yields higher accuracy than single‑vector approaches, but the pairwise token‑level inner‑product computation scales quadratically with token count, leading to prohibitive latency for large corpora.

LEMUR tackles this problem in two successive reductions. First, the authors cast the task of estimating MaxSim scores for all documents as a supervised learning problem. For a corpus of m documents {C₁,…,C_m}, the target function f(X) = (MaxSim(X, C₁),…,MaxSim(X, C_m)) can be rewritten as a sum over token‑level contributions: f(X) = ∑{x∈X} g(x), where g(x) = (max{c∈C₁}⟨c,x⟩,…,max_{c∈C_m}⟨c,x⟩). Thus the learning problem reduces to a multi‑output regression on individual token embeddings. LEMUR implements g with a shallow neural network ϕ(x) = W ψ(x), where ψ: ℝ^d → ℝ^{d′} is a one‑hidden‑layer MLP (linear layer + GELU + layer‑norm) and W∈ℝ^{m×d′} is a linear output layer without bias. Training minimizes mean‑squared error between ϕ(x) and the true g(x). Crucially, the output layer is linear, so the estimated MaxSim for a whole query X becomes f̂(X) ≈ W Ψ(X) with Ψ(X) = Π_{x∈X}ψ(x) (the pooled token representations). This formulation allows the model to learn token‑level contributions while keeping inference cost low.

The second reduction exploits the linearity of the output layer. Each document j is represented by the row vector w_j of W, and a query is represented by the single vector Ψ(X). The estimated MaxSim scores are simply inner products ⟨w_j, Ψ(X)⟩. Consequently, finding the top‑k′ documents reduces to a maximum inner‑product search (MIPS) in the d′‑dimensional latent space. Existing highly optimized ANNS libraries (Faiss, ScaNN, LoRANN, etc.) can be used directly on the set {w_j}. After retrieving a modest candidate set (k′ ≫ k), LEMUR recomputes the exact MaxSim for those candidates to produce the final top‑k results, preserving the high recall of the original multi‑vector model.

Training data can be obtained without any labeled query set. The authors either use available query collections (when present) or, more generally, treat a random subset of documents as pseudo‑queries by encoding them with the query encoder Q of the underlying multi‑vector model. Experiments show that using actual queries yields a slight boost, but the method remains robust when only document‑derived pseudo‑queries are used, demonstrating that LEMUR does not rely on external supervision.

To scale training to massive corpora, LEMUR adopts a two‑stage pre‑training strategy. First, ψ is pre‑trained on a small sampled subset of documents (size m′ ≪ m) using the same regression objective. Then ψ is frozen, and each document weight w_j is obtained by solving a closed‑form ordinary least‑squares problem: w_j = arg min_β E

LEMUR: Learned Multi-Vector Retrieval

💡 Research Summary

Comments & Academic Discussion

Leave a Comment