EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding

Reading time: 4 minute
...

📝 Original Info

  • Title: EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding
  • ArXiv ID: 2601.01547
  • Date: 2026-01-04
  • Authors: Tianjun Gu, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan

📝 Abstract

Figure 1 . The concept of Teleo-Spatial Intelligence (TSI) in contrast to current paradigms. Current approaches are fundamentally objectcentric. They are limited to Physical-Dynamic Reasoning-understanding how objects move and interact-but fail to grasp the underlying human purpose behind these changes. Our proposed TSI is a human-centric paradigm that unifies physical dynamics with Intent-Driven Reasoning. This synergy enables a holistic comprehension by inferring why spatial changes occur from how they happen.

💡 Deep Analysis

Figure 1

📄 Full Content

EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding Tianjun Gu1, Chenghua Gong1, Jingyu Gong1, Zhizhong Zhang1, Yuan Xie1,3, Lizhuang Ma1, Xin Tan1,2 1East China Normal University 2Shanghai AI Lab 3Shanghai Innovation Institute Figure 1. The concept of Teleo-Spatial Intelligence (TSI) in contrast to current paradigms. Current approaches are fundamentally object- centric. They are limited to Physical-Dynamic Reasoning—understanding how objects move and interact—but fail to grasp the underlying human purpose behind these changes. Our proposed TSI is a human-centric paradigm that unifies physical dynamics with Intent-Driven Reasoning. This synergy enables a holistic comprehension by inferring why spatial changes occur from how they happen. Abstract The ability to reason about spatial dynamics is a cor- nerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning—understanding the physical principles of object interactions—and Intent-Driven Rea- soning—inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, con- sisting of a large-scale, open-world benchmark (Escher- Bench) dataset (Escher-35k) and model(Escher series). De- rived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent’s abil- ity to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric sce- narios. Crucially, it is the first benchmark to systemati- cally assess Intent-Driven Reasoning, challenging models All data released on https://huggingface.co/datasets/ Gradygu3u/EscherVerse-Data. to connect physical events to their underlying human pur- poses. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intel- ligence from passive scene description towards a holistic, purpose-driven understanding of the world. “Flatten shapes drive me crazy...Do something, get out of the paper, show me your stuff!”–Escher 1. Introduction The ability to perceive and reason about the spatial dy- namics of the world is a key component of spatial intelligence[16]. However, current spatial intelligence struggles to enable embodied agents to operate efficiently in human environments, a deeper understanding of spa- tial intelligence is needed, which we conceptualise as Teleo-Spatial Intelligence (TSI). TSI transcends static scene description by integrating two key pillars of reasoning: Physical-Dynamic Reasoning, the ability to understand how objects move, interact, and change according to physical principles, and Intent-Driven Reasoning, the ability to infer the human goals and purposes behind these spatial changes. 1 arXiv:2601.01547v1 [cs.CV] 4 Jan 2026 Despite significant progress in spatial understanding, current research has three key limitations that hinder the exploration of TSI. First, existing benchmarks are con- fined to constrained environments. This includes purely simulated worlds, such as Habitat[21, 22], ProcTHOR[11], which offer control but lack real-world complexity, as well as datasets[4, 9, 12, 13, 18, 28] derived from static 3D scans of real indoor spaces, including ScanNet[10] and ARKitScenes[7]. While both of these approaches are valu- able, they are limited to curated, often static, indoor set- tings, creating a significant domain gap and failing to cap- ture the open-world complexity essential for both physi- cal and intentional reasoning. Second, the research of spa- tial intelligence focus has largely remained on static scene comprehension. This overlooks the crucial element of dy- namic reasoning, which is the very foundation of Physical- Dynamic Reasoning. Third, and most importantly, exist- ing approaches treat scenes as sterile, object-centric ar- rangements, completely neglecting Intent-Driven Reason- ing. They can determine that a chair was moved, but not why or what that implies for future events, thus failing to bridge the gap towards genuine TSI. To bridge these gaps and catalyze research towards a holistic evaluation of Teleo-Spatial Intelligence, we intro- duce EscherVerse, a new, large-scale, open-world bench- mark (Escher-Bench) and dataset (Escher-35k) designed to propel spatial reasoning from static, simulated worlds to the dynamic, human-centric open world. Our work is founded on three core principles designed to address the pillars of TSI comprehensively: (1) From Simulation to Reality: es- tablishing an open-world foundation for TSI. We move be- yond constrained virtual indoor scenes by sourcing data from real-world videos. (2) From Static to Dynamic: focus- ing on the core of Physical-Dynamic Reasoning, Escher- Verse explicitly targets dynamic events. Our benchmark req

📸 Image Gallery

case1.png case2.png case3.png case_study.png case_study1.png case_study2.png case_study3.png case_study4.png cases.png consist_of_data.png consist_of_data2.png escher_icon.png human_icon.png object_icon.png pipeline.png tizer.png tizer_ori.png wordcloud_verbs_35k.png wordcloud_verbs_8k.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut