Meissa: Multi-modal Medical Agentic Intelligence
Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model’s own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.
💡 Research Summary
The paper addresses a critical gap in current medical AI systems: while multimodal large language models (MM‑LLMs) have achieved impressive performance on image understanding and clinical reasoning, the most capable medical agents rely on proprietary frontier models (e.g., GPT, Gemini, Claude) accessed via cloud APIs. This deployment paradigm incurs high monetary cost, introduces latency that can disrupt clinical workflows, and raises privacy concerns because patient data must leave the premises. To overcome these limitations, the authors introduce Meissa, a 4‑billion‑parameter multimodal medical model that can operate fully offline yet retain agentic capabilities such as tool use, multi‑turn interaction, and role‑based collaboration.
Meissa’s novelty lies in three methodological contributions. First, the authors define a unified trajectory formalism based on a state‑action‑observation (SAO) tuple. Each interaction step is represented as (sₜ, aₜ, oₜ₊₁), where the state contains the conversation history and any embedded images, the action is either a JSON‑encoded tool call or a final answer token, and the observation is the tool’s structured response (text, bounding boxes, new images, etc.). This SAO representation abstracts away environment‑specific APIs, allowing a single model to learn across heterogeneous medical settings: continuous tool chains, interleaved image‑text reasoning, multi‑agent debates, and clinical simulations.
Second, the training data are generated through a three‑tier stratified supervision pipeline that uses the student model’s own errors as a curriculum signal. Tier 1 collects direct‑reasoning trajectories for samples the student already solves; Tier 2 gathers enhanced‑reasoning trajectories for samples the student fails but a stronger teacher (Gemini‑3‑flash) can solve without external tools; Tier 3 creates full agentic trajectories for the hardest residual cases, invoking four distinct agent environments. This progressive escalation teaches the model an implicit difficulty‑aware routing policy: it learns when a simple parametric answer suffices and when deeper interaction (tool calls, multi‑step reasoning, or expert collaboration) is required.
Third, the authors introduce prospective‑retrospective supervision. For each agentic sample, a prospective trace records the model’s exploratory actions under real observations, while a retrospective trace preserves the same action sequence but attaches hindsight rationalizations and cleaned explanations. Training on both traces stabilizes policy learning, mitigates the noise inherent in raw behavior cloning, and encourages the model to generate both effective actions and coherent justifications.
The resulting dataset comprises roughly 40 K curated trajectories (≈8.2 K direct, 9.8 K enhanced, 23.9 K agentic). Meissa is initialized from Qwen‑VL‑4B and fine‑tuned on this data for about 12 hours on eight A6000 GPUs. Despite having 25 × fewer parameters than frontier models like Gemini‑3, Meissa matches or exceeds them on 10 of 16 evaluation settings across 13 benchmarks covering radiology, pathology, and clinical reasoning. It achieves near‑oracle strategy‑selection accuracy and reduces end‑to‑end latency by ~22 × compared with API‑based deployment.
The paper’s contributions are significant for real‑world healthcare AI. By distilling complex agentic behavior into a compact, offline‑capable model, Meissa reconciles the need for sophisticated tool‑augmented reasoning with strict data‑privacy and latency constraints of hospital environments. The unified SAO trajectory framework and stratified curriculum could be applied to other domains where multi‑modal, multi‑step interaction is essential. Future work may explore expanding the tool repertoire, integrating with electronic health record systems, and continual learning from live clinical interactions. Overall, Meissa demonstrates that high‑quality, agentic medical AI does not require massive proprietary models, opening the door to more accessible, cost‑effective, and secure AI assistants for clinicians.
Comments & Academic Discussion
Loading comments...
Leave a Comment