ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation
Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}, and an online platform can be accessed at \href{http://ai.heartvoice.com.cn/ECG-R1/}{here}.
💡 Research Summary
ECG‑R1 is a novel multimodal large language model (MLLM) specifically engineered for reliable electrocardiogram (ECG) interpretation. The authors identify two critical shortcomings of existing medical MLLMs: (1) frequent hallucinations that produce plausible‑looking but clinically incorrect reports, and (2) a lack of robustness and cross‑modal consistency when either the ECG waveform image or the raw time‑series signal is unavailable. To address these issues, ECG‑R1 introduces three tightly coupled innovations.
First, Protocol‑Guided Instruction Data Generation replaces noisy, LLM‑prompted corpora with a deterministic, protocol‑driven pipeline. A non‑trainable feature extractor (FeatureDB) parses each 12‑lead ECG into 14 quantitative sequences (heart rate, RR intervals, P‑wave amplitude/duration, PR interval, QRS amplitude/duration, T‑wave characteristics, ST segment descriptors, QT/QTc). These features are fed into a “Protocol Guider” that formats a prompt reflecting the five‑phase interpretation workflow defined in the textbook ECG from Basics to Essentials (Technical‑Rate‑Rhythm, Conduction‑Axis‑Intervals, Chamber Hypertrophy‑Voltage, Ischemia‑Infarction‑Mimics, Electrolytes‑QT). Using DeepSeek‑V3.1‑Terminus as a generator, the system produces 30 000 instruction–response pairs, each containing a structured
Second, Modality‑Decoupled Architecture with Interleaved Modality Dropout (IMD) separates image and signal processing streams. The model employs Qwen3‑VL‑8B for visual encoding and ECG‑CoCa for time‑series encoding, each followed by its own linear projector into the shared LLM embedding space. A dedicated
Third, ECG Diagnostic Evidence Rewards (EDER) extends standard RL from answer‑level correctness to evidence‑level reasoning. The reward function comprises (i) a terminal diagnosis reward (matching the ground‑truth label) and (ii) intermediate rewards for each step of the
The authors conduct a comprehensive benchmark across seven proprietary/open‑source MLLMs (including GPT‑5.1, MedGemma, and GEM) and three earlier ECG‑specific models. Using the MIMIC‑IV‑ECG cohort and a panel of 30 board‑certified cardiologists, they evaluate (a) diagnostic accuracy, (b) hallucination rate (percentage of generated statements that are clinically false), and (c) cross‑modal consistency (agreement between image‑only and signal‑only outputs measured by F1). ECG‑R1 achieves 92.4 % diagnostic accuracy (the highest among all competitors, a +7.3 % absolute gain over the best baseline), a hallucination rate of only 3.2 % (versus an average of 27 % for other models), and a consistency score of 0.94 (baseline ≈0.71). Cardiologists rated ECG‑R1’s reports as “clinically useful” with an average score of 4.6/5, indicating strong real‑world applicability.
Limitations are acknowledged: the protocol is tied to a single textbook, which may not capture regional guideline variations; the deterministic FeatureDB may be sensitive to noisy or artifact‑laden recordings; and the RL reward design adds complexity that could hinder reproducibility. Future work is proposed to incorporate multi‑guideline protocol ensembles, develop noise‑robust feature extractors, and explore automated reward shaping.
In summary, ECG‑R1 demonstrates that a pipeline combining (1) protocol‑grounded data generation, (2) modality‑decoupled encoding with theoretically justified interleaved dropout, and (3) evidence‑centric reinforcement learning can produce a multimodal LLM that is both clinically trustworthy and robust to missing modalities—addressing the two major failure modes that have limited prior medical MLLMs in ECG interpretation.
Comments & Academic Discussion
Loading comments...
Leave a Comment