ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}, and an online platform can be accessed at \href{http://ai.heartvoice.com.cn/ECG-R1/}{here}.


💡 Research Summary

ECG‑R1 is a novel multimodal large language model (MLLM) specifically engineered for reliable electrocardiogram (ECG) interpretation. The authors identify two critical shortcomings of existing medical MLLMs: (1) frequent hallucinations that produce plausible‑looking but clinically incorrect reports, and (2) a lack of robustness and cross‑modal consistency when either the ECG waveform image or the raw time‑series signal is unavailable. To address these issues, ECG‑R1 introduces three tightly coupled innovations.

First, Protocol‑Guided Instruction Data Generation replaces noisy, LLM‑prompted corpora with a deterministic, protocol‑driven pipeline. A non‑trainable feature extractor (FeatureDB) parses each 12‑lead ECG into 14 quantitative sequences (heart rate, RR intervals, P‑wave amplitude/duration, PR interval, QRS amplitude/duration, T‑wave characteristics, ST segment descriptors, QT/QTc). These features are fed into a “Protocol Guider” that formats a prompt reflecting the five‑phase interpretation workflow defined in the textbook ECG from Basics to Essentials (Technical‑Rate‑Rhythm, Conduction‑Axis‑Intervals, Chamber Hypertrophy‑Voltage, Ischemia‑Infarction‑Mimics, Electrolytes‑QT). Using DeepSeek‑V3.1‑Terminus as a generator, the system produces 30 000 instruction–response pairs, each containing a structured chain (six‑step reasoning), a concise narrative summary, and a final diagnosis. By grounding every token in measured ECG features and explicit threshold rules, the corpus dramatically reduces hallucination risk.

Second, Modality‑Decoupled Architecture with Interleaved Modality Dropout (IMD) separates image and signal processing streams. The model employs Qwen3‑VL‑8B for visual encoding and ECG‑CoCa for time‑series encoding, each followed by its own linear projector into the shared LLM embedding space. A dedicated tag marks where the signal tokens are inserted, keeping them independent from the token block. During both supervised fine‑tuning (SFT) and reinforcement learning (RL), IMD randomly (i) drops either modality with probability (p_d) and (ii) swaps the order of the two token blocks with probability (p_s). The authors formalize a finite set of test‑time environments (T_{test}) (image‑only, signal‑only, normal order, swapped order) and prove two theorems: (a) minimizing the mixture risk (R_q) upper‑bounds the worst‑case risk (R_{max}) (robustness), and (b) the excess risk controls the total‑variation distance between modality‑specific output distributions (cross‑modal consistency). In practice, with (p_d=0.3) and (p_s=0.2), the mixture risk guarantees that the model remains accurate whether it receives an image, a signal, both, or a shuffled token sequence.

Third, ECG Diagnostic Evidence Rewards (EDER) extends standard RL from answer‑level correctness to evidence‑level reasoning. The reward function comprises (i) a terminal diagnosis reward (matching the ground‑truth label) and (ii) intermediate rewards for each step of the chain that correctly cites quantitative evidence (e.g., “RR interval > 800 ms → bradycardia”). This dual‑level feedback forces the model to internalize the clinical reasoning process rather than merely memorizing final diagnoses. Empirically, after RL the model’s negative log‑likelihood drops by 12 % and the proportion of correctly cited evidence rises from 58 % to 80 % in a held‑out validation set.

The authors conduct a comprehensive benchmark across seven proprietary/open‑source MLLMs (including GPT‑5.1, MedGemma, and GEM) and three earlier ECG‑specific models. Using the MIMIC‑IV‑ECG cohort and a panel of 30 board‑certified cardiologists, they evaluate (a) diagnostic accuracy, (b) hallucination rate (percentage of generated statements that are clinically false), and (c) cross‑modal consistency (agreement between image‑only and signal‑only outputs measured by F1). ECG‑R1 achieves 92.4 % diagnostic accuracy (the highest among all competitors, a +7.3 % absolute gain over the best baseline), a hallucination rate of only 3.2 % (versus an average of 27 % for other models), and a consistency score of 0.94 (baseline ≈0.71). Cardiologists rated ECG‑R1’s reports as “clinically useful” with an average score of 4.6/5, indicating strong real‑world applicability.

Limitations are acknowledged: the protocol is tied to a single textbook, which may not capture regional guideline variations; the deterministic FeatureDB may be sensitive to noisy or artifact‑laden recordings; and the RL reward design adds complexity that could hinder reproducibility. Future work is proposed to incorporate multi‑guideline protocol ensembles, develop noise‑robust feature extractors, and explore automated reward shaping.

In summary, ECG‑R1 demonstrates that a pipeline combining (1) protocol‑grounded data generation, (2) modality‑decoupled encoding with theoretically justified interleaved dropout, and (3) evidence‑centric reinforcement learning can produce a multimodal LLM that is both clinically trustworthy and robust to missing modalities—addressing the two major failure modes that have limited prior medical MLLMs in ECG interpretation.


Comments & Academic Discussion

Loading comments...

Leave a Comment