MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes
Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability. To address these challenges, we propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy. The geometric prior is generated by sampling and clustering ground-truth object data, producing class-level features that dynamically adjust during training to improve geometric understanding. Additionally, we introduce a hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome the limitations of traditional linear decoders in processing complex multimodal inputs. Extensive experiments on the Pix3D dataset demonstrate that MGP-KAD achieves state-of-the-art (SOTA) performance, significantly improving geometric integrity, smoothness, and detail preservation. Our work provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes.
💡 Research Summary
**
MGP‑KAD addresses the long‑standing challenge of reconstructing high‑quality 3D geometry from a single RGB image in cluttered, real‑world scenes. The authors propose a four‑stage pipeline: (1) construction of a class‑level geometric prior library, (2) multimodal feature extraction and dynamic prior retrieval, (3) fusion of visual and geometric cues via multi‑head attention, and (4) a hybrid implicit decoder that combines a conventional linear front‑end with a multi‑scale Kolmogorov‑Arnold Network (KAN) decoder.
The geometric prior library is built by selecting a representative prototype for each object category. Prototypes are chosen as the instance whose surface point cloud is closest to the category‑wise mean distribution. After dimensionality reduction, the prototypes form well‑separated clusters, confirming their discriminative power. To mitigate class imbalance, the authors introduce a dynamic weight allocation that allows under‑represented categories to borrow geometric information from richer categories. Each prototype is encoded into a high‑dimensional feature vector by processing its coordinates and signed distance function (SDF) values through dedicated MLPs and then fusing them.
For feature fusion, a pretrained M3D encoder extracts a 256‑dimensional semantic feature vector from the input image. Simultaneously, the geometric prior library provides a 256‑dimensional geometry feature via a multi‑head attention mechanism. Queries are derived from the image features, while keys and values come from the geometric priors. A 9‑dimensional one‑hot category vector is also fed into the attention block, ensuring explicit class awareness. The attention module uses eight parallel heads, which distributes sensitivity across sub‑spaces, improves robustness to noise, and preserves fine‑grained geometric details.
The decoder consists of two parts. The front‑end transformation network interleaves linear layers with Softplus activations and applies a carefully designed weight initialization (small random weights, negative biases, larger positional weights) to stabilize training and bias the network toward SDF‑like outputs. The transformed latent features are then processed by a multi‑scale KAN decoder. Each KANLinear layer combines a conventional linear transformation with a spline‑based nonlinear component, where B‑spline basis functions are evaluated on a learnable grid. The authors further propose a dynamic grid adaptation algorithm that blends a uniform grid with an adaptive grid derived from input quantiles, allowing the spline resolution to focus on regions with high geometric complexity. The KAN hierarchy reduces dimensions from 128 to 1 (the final SDF value) through successive layers (128→32→16→8→1), capturing global shape while progressively refining local details.
During training, a differentiable volumetric rendering branch (inspired by M3D) is attached to enforce photometric consistency and geometric regularization (depth and normal reprojection losses). This branch is removed at inference time, where the final SDF field is converted to a mesh using Marching Cubes, ensuring fast, geometry‑only prediction.
Experiments on the Pix3D dataset (12,471 image‑model pairs across nine categories) demonstrate that MGP‑KAD outperforms all listed baselines, including SSR, MGN, LIEN, and InstPIFu. Compared to the strongest baseline SSR, MGP‑KAD reduces Chamfer Distance by 9.86 %, increases F‑Score by 6.03 %, and improves Normal Consistency by 12.2 %. Ablation studies confirm that both the KAN decoder and the geometric prior are essential: removing the KAN module inflates CD by 34.9 % and drops F‑Score by 4.27 %; removing the geometric prior also degrades performance across all metrics. The combination of a class‑aware geometric prior and a spline‑based nonlinear decoder enables the network to capture high‑frequency surface details that traditional linear decoders miss.
In summary, the paper contributes (i) a novel class‑level geometric prior modeling framework that dynamically adapts to intra‑class variation and dataset imbalance, (ii) a KAN‑based hybrid decoder that efficiently fuses multimodal inputs and provides powerful nonlinear approximation capabilities, and (iii) state‑of‑the‑art results on a challenging real‑world reconstruction benchmark. The work opens avenues for future research on reducing prior construction overhead (e.g., meta‑learning or online updating) and designing lightweight KAN variants for real‑time applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment