Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.


💡 Research Summary

This paper investigates the trade‑off between two paradigms for multimodal retrieval‑augmented generation (MM‑RAG) in the highly visual domain of glycobiology. The authors built a private benchmark consisting of 120 multiple‑choice questions (MCQs) drawn from 25 glycobiology papers. Questions are stratified into three difficulty levels: “easy” (answer appears directly in the text), “medium” (answer is contained only in tables or figures), and “hard” (answer requires integrating information across text, figures, supplementary tables, or cited references).

Four augmentation strategies are evaluated: (1) None – the LLM receives only the query; (2) Text RAG – standard text‑only retrieval using Docling‑extracted text and OCR, with embeddings from BGE‑base‑en‑v1.5 stored in Qdrant; (3) Multimodal conversion – the same pipeline plus automatic summarization of tables and figures into textual descriptions; (4) Vision‑based retrieval – late‑interaction visual retrievers that index whole page images and return the most similar pages directly to the LLM. The visual retrievers examined are ColPali (based on PaliGemma‑3B), ColQwen (Qwen2‑VL‑2B backbone with adapters), and ColFlor (a lightweight Florence‑2/DAViT‑based model).

Two experimental axes are presented. First, the authors compare an open‑source mid‑size model (Gemma‑3‑27B‑IT) with the proprietary GPT‑4o family across all four augmentations. Each configuration is run five times with answer‑order permutations, and accuracy is reported with Agresti‑Coull 95 % confidence intervals. Gemma‑3‑27B‑IT achieves markedly higher accuracy with Text and Multimodal augmentations (0.722–0.740) than with OCR‑free visual retrieval (≈0.51), indicating that a 27‑billion‑parameter model benefits from reducing the visual reasoning burden.

Second, the authors evaluate the GPT‑5 family (standard, mini, nano) combined with the three visual retrievers. Here, the gap narrows: ColPali and ColFlor each reach a peak accuracy of 0.828, while ColQwen is statistically indistinguishable. Notably, ColFlor, despite having only ~174 M parameters (≈17× smaller than ColPali), matches its accuracy while offering 5–10× faster encoding and lower memory consumption, making it a cost‑effective default when strong generators are available. GPT‑5‑nano trails larger variants by 8–10 %, underscoring the importance of model capacity for visual‑text integration.

Statistical analysis uses paired Wilcoxon signed‑rank tests with Bonferroni correction to assess within‑model differences across augmentations. Additional metrics (precision@5, cost per run, latency) confirm that visual retrieval can be competitive but may incur higher computational cost unless a lightweight retriever like ColFlor is used.

The key conclusions are: (1) Pipeline choice is capacity‑dependent. Mid‑size models achieve the best performance when visual content is converted to text, thereby lowering the downstream reasoning load. (2) Frontier‑scale models (GPT‑4o, GPT‑5) can leverage OCR‑free visual retrieval effectively, narrowing the performance gap. (3) Among visual retrievers, ColFlor offers parity with heavier options at a fraction of the footprint, making it the preferred choice when resources are limited but a strong generator is present. (4) The study provides an initial benchmark and methodological blueprint for building trustworthy multimodal RAG systems in specialized biomedical domains, highlighting that future advances in LLM multimodal reasoning will likely shift the balance toward native visual retrieval.


Comments & Academic Discussion

Loading comments...

Leave a Comment