CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering

CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.


💡 Research Summary

CompactRAG addresses the inefficiencies of existing multi‑hop retrieval‑augmented generation (RAG) systems, which repeatedly invoke large language models (LLMs) at each reasoning hop, leading to high token consumption, latency, and instability in entity grounding. The proposed framework separates corpus preprocessing from online inference. In a one‑time offline stage, an LLM reads the entire corpus and rewrites each document into a set of atomic question‑answer (QA) pairs. Each pair captures a single factual statement, includes explicitly annotated entities, and is stored as a concatenated “question;answer” text. These pairs are then embedded with a dense retriever, creating a compact, semantically aligned knowledge base.

During online inference, a user query is first decomposed into a dependency‑ordered graph of sub‑questions using the LLM (first call). For each sub‑question, a lightweight RoBERTa‑based Answer Extractor retrieves the top‑k relevant QA pairs and predicts the answer span, while a Sub‑Question Rewriter updates subsequent sub‑questions with the newly obtained answers, preventing entity drift. No LLM is involved in these steps. After all sub‑questions are resolved, a second LLM call synthesizes the final answer from the collected evidence. Consequently, the number of LLM invocations per query is fixed at two, regardless of hop depth, and token usage grows only with the size of the retrieved QA snippets, not with the number of reasoning steps.

Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG attains accuracy comparable to strong iterative baselines (within 1‑2 % absolute) while reducing average token consumption by over 40 % and cutting inference latency by roughly 35 %. The token savings become more pronounced as the number of hops increases, confirming the scalability of the design.

Key contributions include: (1) a quantitative analysis of how token cost and LLM calls scale with hop depth in traditional RAG pipelines; (2) the introduction of an offline atomic QA knowledge base that eliminates redundancy and aligns closely with query semantics; (3) a two‑call online architecture that decouples retrieval from reasoning, achieving both efficiency and stable entity grounding; and (4) empirical validation of competitive performance with substantial cost reductions.

Limitations involve the upfront cost of generating high‑quality QA pairs, which depends on a powerful LLM and careful prompting, and the potential for overly fine‑grained QA units to increase retrieval candidate sets. Future work may explore automatic quality assessment of generated QA pairs, dynamic candidate filtering, and extensions to multimodal or web‑scale corpora. Overall, CompactRAG offers a practical, cost‑effective solution for multi‑hop question answering at scale.


Comments & Academic Discussion

Loading comments...

Leave a Comment