Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors

Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.


💡 Research Summary

The paper tackles the long‑standing problem of predicting human visual attention on three‑dimensional (3D) objects by explicitly modeling the interaction between bottom‑up geometric saliency and top‑down semantic relevance. Existing 3D saliency approaches either rely on handcrafted curvature‑based descriptors or on deep networks that process point clouds without any semantic awareness, which leaves them unable to explain why people fixate on flat but semantically meaningful regions such as faces, text, or functional handles.

To address this gap the authors introduce SemGeo‑AttentionNet, a dual‑stream architecture that treats geometry and semantics as separate modalities and fuses them through an asymmetric cross‑modal attention mechanism. The semantic stream is built from frozen diffusion‑based priors: the input mesh is rendered from 100 uniformly sampled viewpoints, each view is conditioned on depth and normal maps and fed into a ControlNet‑conditioned Stable Diffusion model. Features from the later denoising steps (which contain richer semantic content) are combined with DINOv2 visual descriptors, yielding a 2048‑dimensional per‑pixel vector. These vectors are unprojected onto mesh vertices, aggregated with a ball‑query, and averaged across views to produce a per‑vertex semantic descriptor S.

The geometric stream samples 2048 points from the mesh, concatenates their 3‑D coordinates and normals, and encodes them with Point Transformer V3, a state‑of‑the‑art point‑cloud transformer. The resulting 64‑dimensional geometric features are projected to a 32‑dimensional latent space H_geo. The semantic descriptors are compressed to the same dimensionality H_sem using a two‑layer MLP.

Fusion is performed by letting H_geo act as the query (Q) and H_sem as key (K) and value (V) in a multi‑head attention block (4 heads, 8‑dimensional each). This geometry‑to‑semantics attention embodies the cognitive hypothesis that low‑level geometric distinctiveness triggers the retrieval of high‑level semantic knowledge. Consequently, regions that are both geometrically salient and semantically important receive high attention scores, while semantically important but geometrically flat regions are down‑weighted unless they exhibit enough geometric contrast. The fused representation is passed through a sigmoid layer to obtain a per‑vertex saliency probability distribution.

Beyond static saliency, the authors extend the framework to generate temporal scanpaths on the mesh surface. They formulate scanpath generation as a partially observable Markov decision process (POMDP) where each state is the current vertex, actions move to adjacent vertices respecting mesh connectivity, and the reward balances saliency exploitation with an inhibition‑of‑return (IOR) penalty that discourages revisiting already attended areas. The policy is trained with Proximal Policy Optimization (PPO), producing sequences of fixations that mimic human eye movements in virtual‑reality experiments.

Experiments on three public 3D eye‑tracking datasets—SAL3D, NUS3D, and 3DVA—show that SemGeo‑AttentionNet outperforms prior methods across multiple metrics (AUC‑Judd, NSS, CC, KL‑Div). Notably, the model excels at highlighting flat but semantically rich regions (e.g., text on a screen, faces) where geometry‑only baselines fail. Ablation studies confirm the importance of (1) frozen diffusion priors, (2) the asymmetric cross‑attention (symmetrical concatenation degrades performance), and (3) the compression of semantic features to a low‑dimensional space (preventing over‑fitting).

The paper’s contributions are threefold: (i) a novel pipeline that extracts zero‑shot semantic priors from diffusion models and maps them onto raw meshes without requiring textures; (ii) a cognitively motivated asymmetric attention module that formalizes bottom‑up versus top‑down interaction; and (iii) the first reinforcement‑learning‑based scanpath generator that respects 3D mesh topology and incorporates IOR dynamics. The authors suggest future work on real‑time deployment, multimodal extensions (e.g., haptic cues), and task‑oriented scanpath planning.


Comments & Academic Discussion

Loading comments...

Leave a Comment