MuSLR: Multimodal Symbolic Logical Reasoning
Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1’s Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.
💡 Research Summary
The paper introduces MuSLR, the first benchmark specifically designed to evaluate multimodal symbolic logical reasoning, a capability that combines visual and textual information with formal logical inference. MuSLR contains 1,093 carefully curated instances drawn from seven real‑world domains (traffic, healthcare, finance, science, entertainment, sports, and general knowledge). Each instance includes an image, a textual context, a question (either a true/false/unknown judgment or a multiple‑choice selection), a set of formal logical rules (covering propositional logic, first‑order logic, and non‑monotonic logic), and a step‑by‑step ground‑truth reasoning chain. The logical depth of the tasks ranges from two to nine inference steps, and the dataset provides both atomic rules (35) and complex rule combinations (976).
To assess current state‑of‑the‑art vision‑language models (VLMs), the authors evaluate seven leading systems—including GPT‑4.1, GPT‑4‑Vision, LLaVA‑13B, InstructBLIP, MiniGPT‑4, and others—under both zero‑shot and few‑shot chain‑of‑thought (CoT) settings. Results reveal that even the best model, GPT‑4.1, achieves only 46.8 % accuracy on the truth‑evaluation task, with performance sharply declining as reasoning depth increases. Complex logical forms such as first‑order logic and non‑monotonic reasoning suffer even larger drops, indicating that existing VLMs lack robust mechanisms for integrating multimodal evidence with formal deduction.
To bridge this gap, the authors propose LogiCAM (Logical reasoning with Commonsense Augmentation in Multimodalities), a modular framework that decomposes the reasoning process into three stages: (1) Premise Selector, which extracts relevant premises from both image and text and aligns them; (2) Reasoner, which applies approximated formal logical rules directly within the VLM, avoiding lossy text‑only translations; and (3) Reasoning Type Identifier, which detects when symbolic information is insufficient and supplements it with commonsense inference. When combined with GPT‑4.1’s CoT prompting, LogiCAM raises overall accuracy by 14.13 percentage points, reaching roughly 60 % on the full benchmark. Gains are especially pronounced for deeper logical chains and for first‑order logic tasks, where improvements exceed 18 % and 22 % respectively.
A detailed error analysis shows that about 70 % of failures stem from “logical misalignment” between modalities—i.e., the visual facts extracted do not correctly correspond to the textual premises. Approximately 20 % of errors are due to selecting the wrong logical rule, and the remaining 10 % arise from mistakes in intermediate inference steps. This diagnosis highlights the need for better multimodal grounding and more reliable rule‑selection mechanisms.
In summary, the paper makes four major contributions: (1) defining the novel task of multimodal symbolic logical reasoning; (2) releasing the high‑quality MuSLR‑Bench dataset with rich annotations and reasoning traces; (3) presenting LogiCAM, a strong baseline that demonstrates how modular decomposition and commonsense augmentation can substantially improve VLM performance on formal logic tasks; and (4) providing a thorough empirical study that pinpoints current limitations of VLMs, thereby charting a clear research agenda for future work on multimodal neuro‑symbolic AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment