Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo


💡 Research Summary

The paper introduces Neural Memory Object (NeMO), a geometry‑aware, object‑centric representation that enables few‑shot perception of previously unseen objects using only a handful of RGB template views. Unlike prior few‑shot methods that rely on pairwise feature matching between each template and the query image, NeMO aggregates all template information into a single, structured memory that can be queried by a universal decoder.

The system consists of an encoder and a decoder. The encoder first processes each template image with a Vision Transformer (ViT) to obtain patch‑wise visual features. One of the templates is randomly selected as an “anchor” to provide a reference orientation. A multi‑view transformer encoder then refines the features through cross‑ and self‑attention, encouraging interaction between the anchor and the remaining views.

To embed geometry, a set of 3D points Q is uniformly sampled in a normalized cube. Each point is passed through a small MLP (λ) to generate an initial feature vector. The point features act as queries while the refined image features serve as keys and values in a series of transformer decoder blocks – the “Geometric Mapping” module. This module fuses 2D visual cues with 3D spatial information, allowing each point to attend to the most relevant visual evidence across all views.

Simultaneously, an Unsigned Distance Field (UDF) is learned via another MLP that predicts the unsigned distance from each sampled point to the nearest surface point of the object. By differentiating the UDF, a direction vector is obtained, and the surface point s_i = q_i – d_i·v_i is computed. The final NeMO representation χ consists of pairs (s_i, b_f_i), where s_i are surface points in the anchor coordinate system and b_f_i are the geometry‑enhanced point features. Because the surface points and their features are stored explicitly, NeMO can be transformed (rotated, translated, scaled) after creation, enabling flexible downstream processing.

The decoder receives a query image, extracts its features with a ViT, and upsamples them using a Depth Prediction Transformer (DPT) to obtain dense feature maps. Cross‑ and self‑attention layers then combine these query features with the NeMO memory. The decoder produces four dense outputs: (1) modal segmentation mask, (2) amodal segmentation mask, (3) a dense 2‑D‑3‑D correspondence map X (pixel‑to‑surface point mapping), and (4) a confidence map C indicating the reliability of each correspondence. After filtering X with C, a RANSAC‑based PnP solver estimates the 6‑DoF pose of the object in the query image.

Training is performed on a large synthetic RGB‑D dataset where ground‑truth object poses, masks, and camera intrinsics are known. Losses include: a surface‑point regression loss L_χ that aligns predicted surface points with ground truth, Dice + binary cross‑entropy losses for modal and amodal masks, a confidence‑weighted L1 loss for the 2‑D‑3‑D map, and auxiliary certainty/uncertainty losses that shape the confidence map. To enforce invariance to the anchor frame, the NeMO points are randomly transformed (rotation, translation, scaling) before being fed to the decoder during training.

Extensive experiments on the BOP benchmark (including LM‑O, YCB‑V, T‑LESS, and others) demonstrate that NeMO achieves state‑of‑the‑art performance on both model‑free (template‑only) and model‑based (CAD‑rendered) few‑shot tasks. Notably, with as few as 2–5 template images, the method attains high AP and ADD‑S scores while inference time remains essentially constant regardless of the number of templates, making it suitable for real‑time robotic applications.

Key contributions are: (1) the design of a compact, geometry‑sensitive point‑cloud memory that stores object‑specific information outside the network weights, (2) a unified encoder‑decoder architecture that can perform detection, segmentation, and pose estimation for any novel object without retraining, and (3) a synthetic, object‑centric dataset that balances class distribution and clutter to facilitate robust few‑shot learning.

Future work may explore online updating of NeMO from video streams, integration of texture or material cues, dynamic point sampling strategies, and compression techniques to further reduce memory footprint while preserving geometric fidelity.


Comments & Academic Discussion

Loading comments...

Leave a Comment