RARe: Retrieval Augmented Retrieval with In-Context Examples
While in-context learning is well-studied with decoder-only language models (LLMs), its utility for encoder-only models remains underexplored. We study in-context learning for encoder-only models for text retrieval tasks. Can incorporating in-context examples (query-document pairs) to the target query enhance retriever performance? Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This approach achieves performance gains of up to +2.72% nDCG across open-domain retrieval datasets (BeIR, RAR-b) compared to using the target query only as an input. In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation for retrievers and lay the foundation for future work.
💡 Research Summary
The paper introduces RARe (Retrieval‑Augmented Retrieval with In‑Context Examples), a method that brings the benefits of in‑context learning (ICL) to encoder‑only dense retrieval models. While ICL has been extensively studied for decoder‑only large language models (LLMs), its applicability to embedding‑based retrievers has remained unclear. The authors ask whether prepending a set of query‑document pairs that are semantically similar to a target query can improve the query representation learned by a dense retriever.
Methodology
RARe builds on the standard dense retrieval pipeline where a query q and a document d are encoded by a shared embedder E(·) into fixed‑dimensional vectors, and cosine similarity is used for ranking. The key innovation is to augment the input query with a task‑specific instruction and a small set of in‑context examples. For each target query, the system first retrieves the k most semantically similar queries from the training corpus using a sparse BM25 retriever. The corresponding positive documents of those retrieved queries form the in‑context example set D_RAre = {(q_RAre^i, d_RAre^+i)}_{i=1..k}. The final input string is constructed as:
Instruct: {task_instruction}; Query: {q_RAre^1}; Document: {d_RAre^+1}; … ; Query: {q}
This “augmented query” (q_RAre) is then fed to the embedder. The model is fine‑tuned with the same contrastive loss used in conventional dense retrieval, but now the loss is computed on the embedding of q_RAre. By training on such augmented inputs, the embedder learns to incorporate the additional contextual information provided by the examples.
Experimental Setup
The authors evaluate two training regimes: (1) fine‑tuning from a decoder‑only LLM checkpoint (Llama‑3.1‑8b‑Instruct) and (2) fine‑tuning from existing dense retriever checkpoints (LLM2Vec‑Llama‑3‑8B‑Supervised and E5‑Mistral‑Instruct). Training data consists of a publicly released portion of the E5 dataset and the MS‑MARCO passage ranking set. For each training instance, five in‑context examples are drawn from the same dataset, selected by BM25 similarity.
Evaluation is performed on the BeIR benchmark (covering a wide range of domains) and a reasoning‑oriented subset of the RAR‑b benchmark (including HellaSwag, PIQA, ARC‑C, etc.). Performance is measured with nDCG@10, and datasets are split into in‑domain (seen during training) and out‑of‑domain (unseen) groups to assess generalization.
Key Findings
-
Inference‑only augmentation fails – Simply adding in‑context examples to a pre‑trained retriever at test time degrades performance across all three baseline models (SFR‑Embedding‑2‑R, LLM2Vec‑Llama‑3‑8B‑Supervised, E5‑Mistral‑7B‑Instruct). This confirms that encoder‑only models do not automatically benefit from few‑shot examples the way decoder‑only LLMs do.
-
Training with RARe improves over strong baselines – When the model is fine‑tuned with the augmented query format, RARe consistently outperforms two strong baselines: RepLLaMA (plain query) and PromptRetriever (synthetic instruction‑augmented queries). On the reasoning‑heavy RAR‑b benchmark, RARe achieves a +2.72 % absolute gain in nDCG@10 over PromptRetriever and +0.87 % over a random‑example ICL baseline.
-
Out‑of‑domain generalization – Gains are especially pronounced on out‑of‑domain datasets. For example, on the BeIR OOD split, RARe improves nDCG@10 by roughly 1–2 % compared to the same base model without examples, indicating that semantically similar examples act as a form of task‑specific knowledge that transfers across domains.
-
Ablation on example quality and quantity – Experiments varying the number of examples (k) show that 3–7 examples work best; larger k leads to input‑length overflow and hurts performance. Selecting examples by BM25 similarity (i.e., semantic relevance) yields significantly higher gains than random selection, confirming that the relevance of the examples is crucial.
-
Choice of retrieval method for example selection – The authors use BM25 for its speed and simplicity, but note that a dense retriever could replace it without harming performance, suggesting flexibility in the example‑selection pipeline.
Analysis and Insights
RARe demonstrates that encoder‑only models can indeed benefit from few‑shot style context, provided the model is trained to interpret that context. The semantic similarity criterion ensures that the examples are tightly aligned with the target query’s intent, effectively acting as a soft prompt that guides the embedder toward a more discriminative representation. This mirrors the way ICL works in LLMs, but the mechanism differs: instead of expanding the model’s generative capacity, the examples supply additional task‑relevant signal that the contrastive loss can exploit.
The work also highlights practical considerations: (a) the need to fine‑tune the retriever with the augmented format (in‑context examples cannot be added post‑hoc), (b) the trade‑off between the number of examples and model input limits, and (c) the computational overhead of retrieving examples at both training and inference time.
Limitations and Future Directions
The primary limitation is the extra retrieval step required to fetch in‑context examples, which adds latency. Moreover, the method relies on the existence of a sufficiently large corpus of labeled query‑document pairs to find meaningful examples; low‑resource domains may struggle. The authors suggest exploring learned dense retrieval for example selection, dynamic k‑selection based on query difficulty, and extending the approach to multimodal retrieval as promising avenues.
Conclusion
RARe provides a simple yet effective recipe for injecting few‑shot, example‑based knowledge into encoder‑only dense retrievers. By prepending semantically similar query‑document pairs and fine‑tuning with the standard contrastive objective, the method yields consistent improvements across standard and reasoning‑heavy benchmarks, with especially strong out‑of‑domain generalization. The authors release code and model checkpoints, paving the way for broader adoption and further research into in‑context learning for retrieval‑oriented architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment