Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support
In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.
💡 Research Summary
This paper introduces Oph‑Guid‑RAG, a multimodal visual retrieval‑augmented generation (RAG) system designed specifically for ophthalmology clinical question answering and decision support. The core innovation lies in treating each guideline page as an independent evidence unit and retrieving the raw page images rather than extracting text via OCR. By preserving tables, flowcharts, dosage thresholds, and other layout‑dependent information, the system avoids the information loss and noise that plague traditional text‑centric RAG pipelines.
The architecture consists of four stages. In the offline corpus preparation stage, 305 ophthalmology guidelines (totaling 7 001 pages) are converted from PDF to high‑resolution images (5390 × 7940 px), stored in an object storage service (TOS), and indexed with FAISS after encoding with the multimodal retriever ColQwen2.5. No OCR or structural parsing is performed, ensuring that visual cues remain intact.
During query processing, a Planner optionally decomposes a complex user query into up to three focused sub‑questions (SQ1‑SQ3). A Router then decides for each sub‑question whether to follow the RAG path (evidence‑grounded) or the DIRECT path (pure generation). Sub‑questions routed to RAG are passed through a Query Rewrite module that reformulates them into retrieval‑friendly phrasing.
The retrieval‑and‑filter stage encodes the rewritten queries with ColQwen2.5, searches the FAISS index for the top‑k candidate page images, and applies a relevance scorer based on GPT‑5.2 to evaluate each candidate against both the original question and the rewritten sub‑question. Low‑relevance pages are filtered out; if insufficient evidence remains, the system falls back to the DIRECT path.
In the generation stage, evidence‑grounded multimodal reasoning is performed by a vision‑language model that jointly consumes the textual query and the selected page images, while the DIRECT branch uses a text‑only large language model (LLM). A final synthesis module aggregates the sub‑answers into a single response, appends the URLs of any guideline pages used, and records a full process trace for auditability.
Evaluation is conducted on HealthBench, a doctor‑based scoring benchmark that assesses accuracy, completeness, context awareness, and communication quality. On the hard subset—representing the most challenging clinical scenarios—Oph‑Guid‑RAG improves the overall HealthBench score from 0.2969 to 0.3861 (+30 %) and raises accuracy from 0.5956 to 0.6576 (+10.4 %). Compared with GPT‑5.4, the accuracy gain is +24.4 %. Ablation studies demonstrate that removing the routing controller, the reranking/filtering step, or the multimodal retrieval leads to substantial performance drops, confirming that controllable retrieval is essential for stable, reliable outputs.
The authors acknowledge several limitations. Image‑based retrieval incurs higher memory and compute costs than text‑based methods, and page‑level evidence can be overly granular, requiring additional post‑processing to align with clinical workflow. The current system does not yet incorporate non‑guideline evidence such as recent research articles, and real‑time guideline updates are not integrated. Future work will explore scaling the evidence pool, optimizing the visual encoder for efficiency, and building pipelines for continuous guideline ingestion.
In summary, Oph‑Guid‑RAG demonstrates that preserving the visual fidelity of clinical guidelines and coupling it with a controllable retrieval router markedly enhances evidence grounding, traceability, and safety in ophthalmic AI assistants. The study provides a concrete blueprint for building robust, multimodal RAG systems that meet the stringent demands of real‑world medical decision support.
Comments & Academic Discussion
Loading comments...
Leave a Comment