Rethinking On-policy Optimization for Query Augmentation
Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model’s parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
💡 Research Summary
The paper tackles the problem of query augmentation for information retrieval (IR) by systematically comparing two dominant paradigms that have emerged with the rise of large language models (LLMs): (1) prompting‑based pseudo‑document generation, a zero‑ or few‑shot approach that expands the original query with a synthetic document produced by the LLM, and (2) reinforcement‑learning (RL)‑based query rewriting, where the LLM is fine‑tuned as a policy that directly optimizes retrieval metrics such as Recall@K or NDCG. Although both methods have been explored separately, no prior work has evaluated them side‑by‑side under identical experimental conditions.
Methodology
The authors replicate the DeepRetrieval framework (Jiang et al., 2025), which uses on‑policy Proximal Policy Optimization (PPO) with KL‑regularization to train a query‑rewriting policy. They also introduce a simple prompting baseline called SPQE (Simple Pseudo‑document Query Expansion), which instructs an instruction‑following LLM to generate a hypothetical answer document d_H for a query q and then concatenates (q, d_H) as the augmented query. Both methods are evaluated on the same LLM back‑ends (3‑billion‑parameter and 7‑billion‑parameter models), the same retrieval engines (BM25 for sparse retrieval and several dense retrievers), and the same corpora and query sets. The datasets span evidence‑seeking tasks (NQ, TriviaQA, SQuAD) and ad‑hoc retrieval tasks from the BEIR benchmark (FEVER, HotpotQA, etc.).
Key Findings
- Performance vs. Cost – SPQE, despite requiring no training, matches or exceeds the RL‑based DeepRetrieval policy on most benchmarks. The advantage is especially pronounced for sparse retrievers where term‑level expansion directly improves matching.
- RL Sensitivity – The RL policy yields modest gains only in certain scenarios, notably when paired with dense retrievers on ad‑hoc tasks. Its performance is highly sensitive to reward design, and training incurs substantial computational overhead (millions of rollouts, PPO updates).
- Task Dependence – Gains from RL are inconsistent across task types; evidence‑seeking queries see limited improvement, while ad‑hoc queries sometimes benefit, but only when the reward function is carefully engineered.
Hybrid Proposal – OPQE
Motivated by these observations, the authors propose On‑policy Pseudo‑document Query Expansion (OPQE). Instead of training the policy to output a rewritten query directly, OPQE trains the policy to generate a pseudo‑document that maximizes a retrieval‑based reward. The generated pseudo‑document is then concatenated with the original query, preserving the generative flexibility of prompting while allowing gradient‑based optimization toward the retrieval objective. OPQE consistently outperforms both SPQE and DeepRetrieval across all evaluated benchmarks, delivering an average absolute improvement of 2–4 percentage points in Recall@10/NDCG and reducing training cost by roughly 30 % compared to the vanilla RL baseline.
Contributions and Impact
- A rigorously controlled empirical comparison of prompting‑based and RL‑based query augmentation, filling a gap in the literature.
- Evidence that training‑free, prompt‑driven pseudo‑document generation is a strong, resource‑efficient baseline for modern IR pipelines.
- Introduction of OPQE, a novel hybrid that leverages on‑policy optimization for pseudo‑document generation, achieving state‑of‑the‑art performance while being more compute‑efficient.
The authors release their code and data to facilitate reproducibility and suggest future directions such as automated reward shaping, multimodal LLM integration, and deployment with real‑world API‑based retrieval services. Overall, the work demonstrates that sophisticated RL training is not always necessary for effective query augmentation and that a synergistic combination of prompting and policy optimization can yield superior results.
Comments & Academic Discussion
Loading comments...
Leave a Comment