Z3D: Zero-Shot 3D Visual Grounding from Images

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at https://github.com/col14m/z3d .

💡 Research Summary

The paper introduces Z3D, a universal zero‑shot 3D visual grounding system that operates solely on multi‑view RGB images, optionally enriched with camera poses and depth maps. The authors identify two major shortcomings of existing zero‑shot approaches: (1) low‑quality object proposals that limit downstream reasoning, and (2) under‑utilization of modern visual‑language models (VLMs), which are often replaced by non‑generative language models such as BERT or CLIP. Z3D addresses these issues through (i) the integration of a state‑of‑the‑art zero‑shot 3D instance segmentation method, MaskClustering, to generate high‑fidelity class‑agnostic 3D bounding‑box proposals, and (ii) advanced prompt‑based segmentation using SAM3‑Agent, a VLM‑driven segmentation agent that iteratively refines masks based on language prompts.

The pipeline consists of four stages. First, a lightweight CLIP embedding step selects the six most semantically relevant views for a given query, dramatically reducing the number of images that must be processed by the expensive VLM. The VLM then picks the three best views among these candidates. Second, SAM3‑Agent performs zero‑shot object segmentation on the selected views, guided by prompts generated from the VLM’s reasoning. Third, each 2‑D mask is lifted into 3‑D space using the available camera poses (or poses estimated by DUSt3R when absent) to obtain partial point clouds. Fourth, these lifted masks are matched against the 3‑D proposals from MaskClustering via a 3‑D IoU voting scheme; the proposal with the highest vote becomes the final grounding result. When depth or pose information is missing, the system employs DUSt3R to reconstruct a dense depth map, infer camera poses, fuse them into a TSDF volume, and extract a point cloud via marching cubes, keeping the entire process training‑free.

Extensive experiments on ScanRefer (51 K query‑object pairs) and Nr3D (41 K queries) demonstrate that Z3D outperforms all prior zero‑shot methods by a large margin. On ScanRefer, Z3D achieves an absolute gain of +38.7 % in Acc@0.5 over the previous best zero‑shot baseline (OpenScene). On Nr3D, it sets a new state‑of‑the‑art top‑1 accuracy, confirming that both the high‑quality proposals and the VLM‑driven reasoning contribute to the improvement. Ablation studies systematically evaluate each component: MaskClustering alone provides a modest baseline; adding CLIP‑based view pre‑selection and SAM3‑Agent raises performance substantially; incorporating VLM‑based view selection and multi‑view aggregation yields the best results.

The authors acknowledge limitations: reliance on CLIP for the initial view filter may miss subtle concepts; image‑only performance hinges on the quality of DUSt3R reconstructions, which may vary across datasets; and MaskClustering remains the most computationally intensive module, posing a bottleneck for real‑time applications. Nonetheless, Z3D represents a significant step toward practical, supervision‑free 3D scene understanding, demonstrating that multi‑view image grounding can be achieved without any annotated 3‑D data. The work opens avenues for further research on more efficient proposal generators, alternative view‑selection mechanisms, and broader generalization to outdoor or unstructured environments.

Z3D: Zero-Shot 3D Visual Grounding from Images

💡 Research Summary

Comments & Academic Discussion

Leave a Comment