MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds
Food portion estimation is crucial for monitoring health and tracking dietary intake. Image-based dietary assessment, which involves analyzing eating occasion images using computer vision techniques, is increasingly replacing traditional methods such as 24-hour recalls. However, accurately estimating the nutritional content from images remains challenging due to the loss of 3D information when projecting to the 2D image plane. Existing portion estimation methods are challenging to deploy in real-world scenarios due to their reliance on specific requirements, such as physical reference objects, high-quality depth information, or multi-view images and videos. In this paper, we introduce MFP3D, a new framework for accurate food portion estimation using only a single monocular image. Specifically, MFP3D consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and concatenates features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food’s volume and energy content based on the extracted features. Our MFP3D is evaluated on MetaFood3D dataset, demonstrating its significant improvement in accurate portion estimation over existing methods.
💡 Research Summary
MFP3D introduces a novel end‑to‑end framework for estimating food portion size and energy content using only a single monocular RGB image. The system consists of three sequential modules. First, a 3D reconstruction module converts the input image into a 3D point cloud. Depth is estimated with ZoeDepth, a state‑of‑the‑art monocular depth predictor, and combined with the original pixel coordinates to obtain a set of 3‑D points. As an alternative, the authors also employ TriPoSR to reconstruct a mesh from the image and then sample points from that mesh, demonstrating that the pipeline is agnostic to the specific reconstruction technique.
Second, a feature extraction module processes both modalities independently. For the 2‑D image, a ResNet‑50 backbone (pre‑trained on ImageNet) with the final two layers removed is followed by a fully‑connected layer that outputs a 512‑dimensional vector (f_I). For the 3‑D point cloud, CurveNet is used as the backbone; its Local Point Feature Aggregation (LPFA) and CurveNet Inception Convolution (CIC) blocks capture fine‑grained geometric details and multi‑scale context, also producing a 512‑dimensional vector (f_P). The two vectors are concatenated to form a 1024‑dimensional fused representation (f).
Third, a regression module maps the fused feature vector to scalar estimates of food volume and caloric energy. A simple linear layer (or a deeper regression network) is trained with an L1 loss, encouraging accurate absolute predictions.
The authors evaluate MFP3D on the MetaFood3D dataset, which contains 637 food items across 108 categories, complete with 3‑D meshes, point clouds, RGB images, segmentation masks, and nutritional information. They split the data into 510 training and 127 test items, and also report results on the SimpleFood45 benchmark for broader validation. Two evaluation metrics are used: Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).
Experimental results show that MFP3D substantially outperforms prior approaches that rely on physical reference objects, depth sensors, or multi‑view video. When using only the 2‑D image, baseline methods achieve higher MAE; adding the reconstructed point‑cloud features reduces MAE by roughly 10 % and improves volume estimation by 15–20 % compared to depth‑only or multi‑view baselines. The authors also conduct ablation studies comparing ground‑truth point clouds (GTPC), normalized GTPCs (shape only, no scale), depth‑derived point clouds, and TriPoSR‑derived point clouds. All reconstructed point clouds yield competitive performance, confirming that the framework can operate without specialized hardware.
Key insights include: (1) shape information alone, when encoded in a point cloud, provides a strong cue for portion estimation; (2) fusing 2‑D visual cues (color, texture, edges) with 3‑D geometric cues yields a richer representation than either modality alone; (3) the pipeline can be trained end‑to‑end, allowing the depth estimator and feature extractors to adapt jointly to the downstream regression task.
Limitations are acknowledged: depth estimation errors propagate to the point cloud, especially for foods with complex geometry, translucency, or occlusions; the fixed 1024‑point sampling may discard fine details; and the linear regression head may not capture highly non‑linear relationships between shape and caloric density for certain food types.
Future work suggested includes: (a) integrating attention‑based fusion mechanisms to weight 2‑D versus 3‑D features adaptively; (b) exploring adaptive point‑sampling or hierarchical point‑cloud representations to retain more detail; (c) extending the model to predict additional nutritional attributes (macronutrients, micronutrients); and (d) optimizing the entire pipeline for real‑time inference on mobile devices, enabling on‑device dietary monitoring without external sensors.
In summary, MFP3D demonstrates that accurate food portion estimation is feasible with a single RGB photograph by reconstructing a 3‑D point cloud, extracting complementary 2‑D and 3‑D features, and regressing to volume and energy values. This approach removes the need for reference objects, depth cameras, or multi‑view setups, paving the way for scalable, user‑friendly dietary assessment tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment