Human-level 3D shape perception emerges from multi-view learning
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view’ models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
💡 Research Summary
The paper tackles the long‑standing challenge of reproducing human three‑dimensional (3D) shape perception using artificial neural networks. The authors propose a class of “multi‑view” vision transformers that are trained on large‑scale naturalistic scene data where each training sample consists of several images captured from different viewpoints of the same scene. The training objective is purely visual‑spatial: the network must predict for each image the associated camera pose, a dense depth map, and an aleatoric uncertainty (precision) for the depth prediction. Crucially, no object‑specific inductive biases or geometric priors are built into the architecture; the model learns 3D understanding solely from the statistical regularities of multi‑view correspondence, mirroring the cues humans obtain from stereopsis, motion parallax, and proprioception.
Four recent multi‑view transformers are evaluated—DUST3R, MAST3R, Pi3, and the largest, VGGT‑1B. VGGT‑1B uses a standard ViT‑style backbone (the frozen DINOv2‑Large encoder) and adds a multi‑view aggregation head that predicts pose, depth, and per‑pixel precision. The loss combines L1 depth error with a precision‑weighted term, encouraging the network to assign high confidence to pixels that have consistent correspondences across views and low confidence to ambiguous regions.
To test whether these models exhibit human‑like 3D perception, the authors adopt a zero‑shot oddity‑judgment task previously used in cognitive psychology. Each trial presents two images of the same object from different viewpoints (A and A′) and a third image of a different object (B). Human participants (N≈300) must identify the non‑matching object. No model fine‑tuning or linear probing is performed; instead, the model’s internal uncertainty estimates are used directly. For each trial the model encodes all six pairwise image combinations, extracts the predicted precision for each pair, and selects the object whose pairs have the lowest average confidence as the oddity. Accuracy is then computed by comparing this prediction to the ground truth.
Results show that VGGT‑1B matches human average accuracy (≈92 %) while prior single‑image models such as DINOv2‑L perform substantially worse. Moreover, the margin between confidence for matching pairs (A‑A′) and non‑matching pairs (A‑B, A′‑B) correlates strongly with human performance across difficulty levels, indicating that the model’s uncertainty captures the same difficulty cues that drive human judgments. The authors further analyze the evolution of representations across the 24 transformer layers. By measuring cosine similarity of patch tokens for each image pair, they identify a “solution layer” where the network first reliably distinguishes the oddity; this layer typically lies in the middle‑to‑late stages of the model. At this point, similarity among matching images sharply increases while similarity involving the non‑matching image drops, mirroring the emergence of a categorical 3D representation.
A striking finding is that the model’s confidence margin also predicts human reaction times: trials with smaller margins (i.e., more ambiguous) lead to longer human response times, suggesting that the model’s internal estimate of geometric correspondence aligns with human perceptual difficulty. Qualitative visualizations of cross‑image attention maps further illustrate that the network focuses on corresponding surface points when processing matching views, and disperses attention when presented with a mismatched object.
The study concludes that human‑level 3D shape perception can emerge from a simple, scalable learning objective that leverages multi‑view visual‑spatial statistics, without any object‑centric priors. This provides computational support for longstanding cognitive theories proposing that general‑purpose learning mechanisms, operating on natural sensory experience, are sufficient for developing sophisticated perceptual abilities. Limitations include reliance on massive multi‑view datasets, the static nature of the evaluation stimuli, and the indirect link between model uncertainty and neural correlates of human uncertainty. Future work is suggested to incorporate dynamic scenes, richer sensorimotor feedback, and direct neurophysiological comparisons to further bridge artificial and biological vision.
Comments & Academic Discussion
Loading comments...
Leave a Comment