Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

💡 Research Summary

This paper introduces PMSR (Progressive Multimodal Search and Reasoning), a novel framework designed to address the challenges of knowledge-intensive Visual Question Answering (VQA). Traditional multimodal Retrieval-Augmented Generation (RAG) systems typically follow a single-pass “retrieve-then-read” paradigm, which often fails to gather sufficient knowledge and lacks mechanisms to correct misguided reasoning. Furthermore, emerging multimodal agentic approaches condition each step on the entire, unstructured interaction history, leading to error propagation and reasoning “drift” over iterations.

PMSR proposes a fundamentally different strategy: the progressive construction of a structured reasoning trajectory. The framework operates in three core stages: initial record generation, iterative reasoning trajectory updates, and adaptive termination. It begins by generating an initial reasoning record using an MLLM, bootstrapping the trajectory. The core iterative loop then drives knowledge acquisition. At each step, PMSR formulates dual-scope queries: a record-level query conditioned on the latest reasoning record for local refinement, and a trajectory-level query derived from the entire accumulated trajectory to preserve broader intent. These queries are used to perform a joint search across heterogeneous knowledge bases—a textual KB and a multimodal (image-text) KB—retrieving diverse and complementary evidence.

The newly retrieved evidence candidates are not simply accumulated. Instead, they are fed into a compositional reasoning module that synthesizes them into a new, compact reasoning record. This record is appended to the trajectory, updating the reasoning state in a structured and refined manner. Crucially, each new record is generated solely from the fresh evidence, preventing direct contamination from earlier potential errors. The process terminates adaptively based on query saturation, signaling that further iterations yield diminishing returns. The final answer is generated by conditioning the MLLM on the complete, structured reasoning trajectory.

Extensive experiments across six diverse and challenging benchmarks—Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA—demonstrate PMSR’s superior performance. It consistently outperforms strong multimodal RAG and agentic baselines in both retrieval recall and end-to-end answer accuracy, achieving state-of-the-art results on five benchmarks. Ablation studies confirm the synergistic contribution of its key components: dual-scope querying, joint heterogeneous KB search, and compositional record synthesis. Trajectory analysis further reveals that PMSR more effectively corrects early mistakes and maintains stable reasoning paths with reduced drift compared to history-dependent agentic methods.

In summary, PMSR establishes a new paradigm for knowledge-intensive multimodal reasoning. By replacing unstructured history accumulation with a progressive, structured trajectory built through controlled search and synthesis cycles, it enables more robust knowledge acquisition, coherent evidence integration, and reliable multi-step reasoning.

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment