Figure 1 . The concept of Teleo-Spatial Intelligence (TSI) in contrast to current paradigms. Current approaches are fundamentally objectcentric. They are limited to Physical-Dynamic Reasoning-understanding how objects move and interact-but fail to grasp the underlying human purpose behind these changes. Our proposed TSI is a human-centric paradigm that unifies physical dynamics with Intent-Driven Reasoning. This synergy enables a holistic comprehension by inferring why spatial changes occur from how they happen.
💡 Deep Analysis
📄 Full Content
EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial
Intelligence with Physical-Dynamic and Intent-Driven Understanding
Tianjun Gu1, Chenghua Gong1, Jingyu Gong1, Zhizhong Zhang1,
Yuan Xie1,3, Lizhuang Ma1, Xin Tan1,2
1East China Normal University 2Shanghai AI Lab 3Shanghai Innovation Institute
Figure 1. The concept of Teleo-Spatial Intelligence (TSI) in contrast to current paradigms. Current approaches are fundamentally object-
centric. They are limited to Physical-Dynamic Reasoning—understanding how objects move and interact—but fail to grasp the underlying
human purpose behind these changes. Our proposed TSI is a human-centric paradigm that unifies physical dynamics with Intent-Driven
Reasoning. This synergy enables a holistic comprehension by inferring why spatial changes occur from how they happen.
Abstract
The ability to reason about spatial dynamics is a cor-
nerstone of intelligence, yet current research overlooks
the human intent behind spatial changes.
To address
these limitations, we introduce Teleo-Spatial Intelligence
(TSI), a new paradigm that unifies two critical pillars:
Physical-Dynamic Reasoning—understanding the physical
principles of object interactions—and Intent-Driven Rea-
soning—inferring the human goals behind these actions.
To catalyze research in TSI, we present EscherVerse, con-
sisting of a large-scale, open-world benchmark (Escher-
Bench) dataset (Escher-35k) and model(Escher series). De-
rived from real-world videos, EscherVerse moves beyond
constrained settings to explicitly evaluate an agent’s abil-
ity to reason about object permanence, state transitions,
and trajectory prediction in dynamic, human-centric sce-
narios. Crucially, it is the first benchmark to systemati-
cally assess Intent-Driven Reasoning, challenging models
All data released on https://huggingface.co/datasets/
Gradygu3u/EscherVerse-Data.
to connect physical events to their underlying human pur-
poses. Our work, including a novel data curation pipeline,
provides a foundational resource to advance spatial intel-
ligence from passive scene description towards a holistic,
purpose-driven understanding of the world.
“Flatten shapes drive me crazy...Do something,
get out of the paper, show me your stuff!”–Escher
1. Introduction
The ability to perceive and reason about the spatial dy-
namics of the world is a key component of spatial
intelligence[16].
However, current spatial intelligence
struggles to enable embodied agents to operate efficiently
in human environments, a deeper understanding of spa-
tial intelligence is needed, which we conceptualise as
Teleo-Spatial Intelligence (TSI). TSI transcends static scene
description by integrating two key pillars of reasoning:
Physical-Dynamic Reasoning, the ability to understand how
objects move, interact, and change according to physical
principles, and Intent-Driven Reasoning, the ability to infer
the human goals and purposes behind these spatial changes.
1
arXiv:2601.01547v1 [cs.CV] 4 Jan 2026
Despite significant progress in spatial understanding,
current research has three key limitations that hinder the
exploration of TSI. First, existing benchmarks are con-
fined to constrained environments.
This includes purely
simulated worlds, such as Habitat[21, 22], ProcTHOR[11],
which offer control but lack real-world complexity, as well
as datasets[4, 9, 12, 13, 18, 28] derived from static 3D
scans of real indoor spaces, including ScanNet[10] and
ARKitScenes[7]. While both of these approaches are valu-
able, they are limited to curated, often static, indoor set-
tings, creating a significant domain gap and failing to cap-
ture the open-world complexity essential for both physi-
cal and intentional reasoning. Second, the research of spa-
tial intelligence focus has largely remained on static scene
comprehension. This overlooks the crucial element of dy-
namic reasoning, which is the very foundation of Physical-
Dynamic Reasoning. Third, and most importantly, exist-
ing approaches treat scenes as sterile, object-centric ar-
rangements, completely neglecting Intent-Driven Reason-
ing. They can determine that a chair was moved, but not
why or what that implies for future events, thus failing to
bridge the gap towards genuine TSI.
To bridge these gaps and catalyze research towards a
holistic evaluation of Teleo-Spatial Intelligence, we intro-
duce EscherVerse, a new, large-scale, open-world bench-
mark (Escher-Bench) and dataset (Escher-35k) designed to
propel spatial reasoning from static, simulated worlds to the
dynamic, human-centric open world. Our work is founded
on three core principles designed to address the pillars of
TSI comprehensively: (1) From Simulation to Reality: es-
tablishing an open-world foundation for TSI. We move be-
yond constrained virtual indoor scenes by sourcing data
from real-world videos. (2) From Static to Dynamic: focus-
ing on the core of Physical-Dynamic Reasoning, Escher-
Verse explicitly targets dynamic events.
Our benchmark
req