Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object’s mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.

💡 Research Summary

This paper addresses the problem of estimating an object’s inertial mass using only visual sensors, a capability that can greatly improve robotic grasping, manipulation, and simulation without requiring prior physical contact. The authors observe that existing RGB‑only approaches suffer from inherent ambiguities in scale and density, and that large‑scale datasets containing both depth and mass information are virtually nonexistent. To overcome these limitations, they create a synthetic RGB‑D dataset based on 8,948 ShapeNetSem 3D models, rendering each object from 14 viewpoints with a simulated Kinect camera. Depth values are normalized by the object’s bounding‑box diagonal, yielding scale‑invariant depth maps that can later be converted back to metric units.

Using this synthetic data, they fine‑tune a GLPDepth model to predict dense depth maps. The trained depth estimator is then applied to the public image2mass dataset (≈150 k items), augmenting it with dense depth information and producing a large‑scale RGB‑D‑mass dataset suitable for deep learning.

The proposed mass‑estimation architecture follows an encoder‑decoder paradigm. RGB images are processed by a DenseNet‑121 encoder, while sparse point clouds derived from the depth maps are encoded by one of three alternatives: (1) PointNet, which treats each point independently with shared linear layers and a global max‑pool; (2) DGCNN, which builds dynamic k‑NN graphs to capture local geometry and employs residual connections; and (3) PointTransformer, which uses vector‑attention over local neighborhoods and progressive down‑sampling. The latent vector from the encoders feeds two separate decoders: a density decoder and a volume decoder. The density decoder uses a custom activation function fitted to the empirical density distribution of the training set, guiding predictions toward realistic values. The volume decoder employs a ReLU to ensure positive volume estimates. Final mass is obtained by multiplying the two decoder outputs, after scaling them with a constant b≈16.5 to balance the magnitude of density and volume terms.

An optional FoldingNet decoder can be attached for point‑cloud reconstruction, trained with Chamfer Distance. However, experiments show that this auxiliary task does not significantly improve mass‑prediction accuracy, suggesting that the primary regression objective dominates learning.

The authors evaluate the system with scale‑invariant metrics: Absolute Log Difference Error (ALDE), Absolute Percentage Error (APE), Minimum Ratio Error (MnRE), and a “q‑metric” measuring the percentage of predictions within a factor of two of the ground truth. Across all metrics, the multimodal model outperforms the RGB‑only baseline by a large margin. Notably, MnRE and q‑metric indicate that over 85 % of predictions lie within a factor of two, and ALDE/APE errors are reduced by more than 30 % relative to prior work.

Key contributions include: (1) a practical pipeline for generating dense depth maps from synthetic data and augmenting an existing large‑scale mass dataset; (2) a thorough comparison of three point‑cloud encoders, demonstrating that DGCNN yields the best trade‑off between accuracy and computational cost; (3) a novel decoder design that balances density and volume predictions via a scaling constant; and (4) a comprehensive evaluation showing that integrating depth information dramatically improves mass estimation in unconstrained, real‑world object categories.

The paper concludes by highlighting future directions such as real‑time depth estimation on robot platforms, extending density modeling to heterogeneous materials, and testing the approach in dynamic, multi‑view scenarios.

Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment