Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias
Retrieval Augmented Generation (RAG) enhances language model performance by incorporating external knowledge retrieved from large corpora, which makes it highly suitable for tasks such as open domain question answering. Standard RAG systems typically rely on a fixed top k retrieval strategy, which can either miss relevant information or introduce semantically irrelevant passages, known as distractors, that degrade output quality. Additionally, the positioning of retrieved passages within the input context can influence the model attention and generation outcomes. Context placed in the middle tends to be overlooked, which is an issue known as the “lost in the middle” phenomenon. In this work, we systematically analyze the impact of distractors on generation quality, and quantify their effects under varying conditions. We also investigate how the position of relevant passages within the context window affects their influence on generation. Building on these insights, we propose a context-size classifier that dynamically predicts the optimal number of documents to retrieve based on query-specific informational needs. We integrate this approach into a full RAG pipeline, and demonstrate improved performance over fixed k baselines.
💡 Research Summary
Retrieval‑Augmented Generation (RAG) has become a popular paradigm for enhancing large language models (LLMs) with external knowledge, especially in open‑domain question answering and multi‑hop reasoning tasks. Traditional RAG pipelines retrieve a fixed number of top‑k passages for every query and concatenate them with the query before feeding the combined context to a generative model. While straightforward, this fixed‑k approach suffers from two well‑documented problems: (1) the inclusion of semantically irrelevant “distractor” passages that dilute useful information and can mislead the generator, and (2) a positional bias in LLMs, often called the “lost in the middle” phenomenon, where information placed in the middle of the input receives less attention than content at the beginning or end.
The authors first conduct a systematic empirical study on the MuSiQue‑Ans benchmark, which contains 2‑hop, 3‑hop, and 4‑hop questions with gold supporting passages. By varying the number of gold passages, adding distractors, and changing the placement of relevant passages (beginning, middle, end), they quantify the impact on generation quality measured by Exact Match (EM) and F1. Results show that a single distractor can drop EM by more than 26 % for 2‑hop questions, while 3‑hop and 4‑hop questions suffer smaller yet still significant drops (≈13 %–14 %). Positional experiments reveal that placing relevant passages at the end yields the highest scores, whereas middle placement leads to the worst performance, confirming the known extremity bias of LLMs.
To address these issues, the paper proposes a dynamic‑k framework. A query‑specific “context‑size classifier” predicts the optimal number of passages k to retrieve for each query. The classifier is built on RoBERTa‑base and trained as a multi‑class problem where the target label is the hop type (2, 3, or 4). The model achieves 87.3 % accuracy across three datasets (MuSiQue‑Ans, 2WikiMultihopQA, MultihopRAG) and comparable performance when trained on a single dataset, indicating good generalization.
The dynamic‑k pipeline works as follows: (1) The classifier processes the query and outputs k_pred. (2) A retriever (BM25, dense vector search, or a two‑stage ColBERT + MonoT5/BGE‑Reranker) fetches a larger pool of candidate passages (e.g., top‑50). (3) An LLM reranker (Mistral Nemo Instruct, 12.2 B parameters) receives the query, the predicted k, and the candidate pool, and is prompted to select exactly k_pred most relevant passages. (4) The selected passages are concatenated with the query and fed to Flan‑T5‑XL for final answer generation.
Extensive experiments compare three configurations: (a) a baseline fixed‑k (k = 5) system, (b) a classifier‑k system without reranking, and (c) the full classifier‑k + LLM reranker system. Across all retrieval back‑ends, the full system consistently outperforms the baseline. On the MuSiQue‑Ans development set, EM improves from 0.58 (baseline) to 0.66 (full system) and F1 from 0.62 to 0.71. The gains are especially pronounced for simpler 2‑hop queries, where the dynamic selection reduces exposure to distractors and places the most relevant passages toward the end of the context, mitigating the “lost in the middle” effect.
The paper’s contributions are threefold: (1) a quantitative analysis of how distractor ratio and passage position affect RAG generation quality, (2) the introduction of a lightweight, query‑aware classifier that predicts the optimal number of passages to retrieve, and (3) integration of this classifier with an LLM‑based reranker to form a flexible, adaptive RAG pipeline. The results demonstrate that dynamically adjusting context size and ordering can substantially improve answer correctness without requiring major changes to the underlying retrieval or generation models. This work thus provides a practical roadmap for building more robust RAG systems, particularly in settings that involve multi‑hop reasoning, heterogeneous query complexity, and limited input windows.
Comments & Academic Discussion
Loading comments...
Leave a Comment