SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 40 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.
💡 Research Summary
The paper addresses a critical gap in the evaluation and improvement of spatial intelligence for multimodal large language models (MLLMs). Existing benchmarks are fragmented, focusing on superficial spatial queries and lacking comprehensive coverage of visual geometry, camera pose, motion, and 3‑D reasoning. To remedy this, the authors introduce SpatialScore, a new benchmark consisting of roughly 5,000 manually verified question‑answer pairs that span 30 distinct spatial tasks across ten intuitive categories (mental animation, counting, depth estimation, object distance, object motion, camera pose & motion, temporal reasoning, view reasoning, object size, and object localization). The dataset draws from repurposed 3‑D annotations (ScanNet++, Omni3D, WildRGB‑D, PointOdyssey, CA‑1M) and integrates samples from 23 public datasets, resulting in a balanced mix of judgment, multiple‑choice, and open‑ended formats. Distractors are generated through three strategies—random same‑category sampling, controlled numeric perturbations, and LLM‑generated confusing alternatives—ensuring that models cannot rely on shortcuts.
Using SpatialScore, the authors evaluate 40 state‑of‑the‑art MLLMs, including Qwen‑VL, LLaVA, Gemini‑Pro, and others. The results reveal a substantial performance gap: models achieve modest accuracy on simple object‑presence tasks but struggle dramatically with dynamic reasoning, camera motion estimation, and 3‑D relational queries, lagging human performance by more than 20 percentage points on average.
To close this gap, the paper proposes two complementary pathways. First, SpatialCorpus, a large‑scale training resource containing 331,000 multimodal QA pairs, is constructed by augmenting the benchmark with additional synthetic and real‑world samples. The corpus covers both image and video modalities and maintains a balanced distribution across task types. Fine‑tuning Qwen3‑VL on SpatialCorpus yields consistent gains of 8‑15 % absolute accuracy across most categories, with notable improvements in camera pose, depth estimation, and object‑distance reasoning.
Second, the authors develop SpatialAgent, a tool‑orchestrated multi‑agent framework that equips pretrained MLLMs with 12 specialized spatial perception tools (e.g., depth estimator, camera pose estimator, object detector, motion tracker). SpatialAgent supports two reasoning paradigms: (i) Plan‑Execute, which decomposes complex queries into a hierarchy of sub‑tasks executed sequentially, and (ii) ReAct, an interleaved reasoning‑and‑action loop that iteratively refines answers using tool feedback. Importantly, SpatialAgent operates in a training‑free manner; by simply prompting the base model to invoke tools, it achieves an average 12 % absolute boost over the same models without agents, and brings performance on dynamic questions (e.g., “Is the camera moving left or right?”) close to human levels.
The paper includes extensive ablation studies that dissect the contributions of data quality, tool selection, and reasoning strategy. It also provides detailed implementation specifications for the toolbox, annotation pipelines, and evaluation protocols, ensuring reproducibility. Limitations are acknowledged: the agent’s effectiveness depends on the accuracy and coverage of the underlying tools, and the current set of 12 tools does not span all possible domains (e.g., medical imaging, satellite data). Future work is suggested to explore automated tool selection, multi‑agent collaboration, and real‑time robotic applications.
In summary, this work delivers (1) the most comprehensive spatial intelligence benchmark to date, (2) a large‑scale, high‑quality training corpus, and (3) a versatile, training‑free agent framework. By releasing all data, code, and models, the authors provide a solid foundation for the community to advance MLLMs toward human‑level spatial reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment