Light Cones For Vision: Simple Causal Priors For Visual Hierarchy
Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: https://github.com/iclrsubmissiongram/loco.
💡 Research Summary
The paper tackles a fundamental limitation of current object‑centric vision models: they treat objects as independent points in Euclidean space, which makes it impossible to represent hierarchical part‑whole relationships such as “a wheel is part of a car”. To overcome this, the authors propose Worldline Slot Attention (LoCo), a method that embeds objects as persistent trajectories—worldlines—in a (d + 1)‑dimensional Lorentzian spacetime. Each object is assigned a spatial center µ_i that is shared across multiple slots, while each slot occupies a distinct temporal coordinate t_j. In this way, an object’s representation spans several hierarchy levels (e.g., whole, part, sub‑part) along a vertical line in spacetime.
The Lorentzian metric ⟨x, y⟩L = x⁰y⁰ − ∑{k=1}^d x^k y^k introduces a mixed signature (+, −, −, …) that naturally encodes asymmetry: the temporal dimension contributes positively, spatial dimensions negatively. This asymmetry creates light cones—future‑directed regions where information can flow. Slots with low temporal values have wide cones (they can attend to many features), while slots with high temporal values have narrow cones (they attend to few, more specific features). This mirrors the causal direction of visual hierarchies: abstract wholes influence many parts, but parts do not influence the whole.
Attention is computed by combining the absolute Lorentzian distance |d_L(f, s)| between a feature f and a slot s with a cone‑membership score. The cone function depends on the temporal gap τ = f⁰ − s⁰ and spatial distance r = ‖f − s‖, and is modulated by an adaptive horizon h_j that is learned from local feature density. The final attention weight is
softmax( −|d_L| + λ·tanh(cone) ),
with λ and a temperature τ_temp as hyper‑parameters. Weighted feature aggregates are then passed through a GRU to update each slot.
The authors evaluate LoCo on three synthetic datasets that encode three‑level hierarchies via point clouds:
- Toy Hierarchical – 3 objects per scene, 3 levels (center, parts, sub‑parts) with 10 % noise.
- Sprites – similar structure using sprite bodies, limbs, joints.
- CLEVR – derived from CLEVR annotations, each object has a center, 3‑5 parts, and 8‑15 sub‑parts.
All experiments use the same lightweight architecture (≈ 11 K parameters), the same training schedule (300 epochs, 3 attention iterations), and identical hyper‑parameters across models. Four baselines are compared: (a) LoCo (Lorentzian worldlines with adaptive cones), (b) Hyperbolic WL (worldlines embedded in a Poincaré ball), (c) Euclidean WL (worldlines without any geometric structure), and (d) Euclidean Std (independent slots as in original Slot Attention).
Two metrics are reported: Object ARI (Adjusted Rand Index for clustering) and Level Accuracy (accuracy of assigning slots to the correct hierarchy level, using the fixed temporal mapping). Results are striking: Euclidean worldlines collapse to a level accuracy of 0.078 on all three datasets—well below random chance (0.33) and with zero variance—demonstrating that without a directional geometry the worldline binding constraint is meaningless. In contrast, Lorentzian worldlines achieve 0.48–0.66 level accuracy, a 6‑ to 8‑fold improvement, with statistical significance p < 0.0001 across 20+ independent runs. Hyperbolic worldlines perform in between (0.35–0.53), confirming that a symmetric tree‑like geometry is insufficient for visual part‑whole hierarchies. Object ARI follows the same trend: LoCo attains 0.45–0.51, hyperbolic 0.15–0.20, Euclidean WL 0.51–0.52 (but only because it ignores hierarchy), and Euclidean Std shows high variance.
The authors argue that the asymmetric causal structure encoded by Lorentzian light cones is the key inductive bias for hierarchical object discovery. Euclidean space lacks a notion of “past vs. future”, so slots at different temporal coordinates are indistinguishable, leading to catastrophic collapse. Hyperbolic space encodes hierarchy via radial distance from an origin, which models taxonomic trees but not the causal dependency where a part exists only because its whole exists.
Limitations are openly discussed: (i) the datasets assume a density‑based hierarchy (sparser points = higher‑level abstractions), which may not hold for natural semantics; (ii) the method fixes the hierarchy depth to three levels, whereas real scenes have variable depth; (iii) experiments are limited to 2‑D point clouds rather than raw pixels, leaving integration with CNNs or Vision Transformers for future work. The authors suggest extending to COCO‑Parts, PartImageNet, and dynamic depth learning as next steps.
In conclusion, the paper provides strong empirical evidence that geometry matters when architectural constraints impose a structure that Euclidean symmetry cannot satisfy. By co‑designing the model architecture (worldline binding) with a Lorentzian embedding space, the authors enable a lightweight (11 K parameters) system to discover part‑whole hierarchies that Euclidean or hyperbolic counterparts cannot. This work opens a new direction for object‑centric learning: leveraging differential‑geometric priors—specifically Lorentzian spacetime—to capture asymmetric causal relationships inherent in visual scenes.
Comments & Academic Discussion
Loading comments...
Leave a Comment