Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery
Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.
💡 Research Summary
The paper tackles the long‑standing problem of metric ambiguity and temporal instability in monocular video‑based human mesh recovery. While prior works such as VIBE, TCMR, and GLoT rely solely on RGB cues and temporal smoothing, they cannot resolve depth ordering, scale drift, or occlusion‑induced jitter because a single 2D observation admits infinitely many 3D configurations. The authors propose a three‑component framework that explicitly injects depth information into the reconstruction pipeline, thereby providing metric awareness and stronger geometric constraints.
-
Depth‑Guided Multi‑Scale Fusion (DMFS) – RGB features are extracted with a ResNet‑50 backbone, while depth features are obtained from a pretrained Depth Anything v2 model. Instead of using raw depth values, intermediate activations are refined by lightweight convolutions and up‑sampling, then transformed into a per‑pixel modulation mask via two 1×1 convolutions and a sigmoid. This mask multiplicatively gates the RGB stream, and a channel‑wise gating mechanism (learned through adaptive pooling and an MLP) balances the contributions of RGB and depth modalities based on their confidence. The fused representation ˜Fₜ is obtained by concatenating the gated streams and passing them through a projection head. This design mitigates the impact of noisy depth estimates by allowing the network to down‑weight depth when its confidence is low.
-
Depth‑guided Metric‑Aware Pose‑Shape (D‑MAPS) – Using ˜Fₜ, the system predicts swing components derived from normalized bone directions and twist components from local depth patches. A lightweight self‑attention aggregates these across time to produce an initial pose p_init. Simultaneously, bone lengths are estimated as depth‑confidence‑weighted temporal averages; these are blended with pre‑computed template statistics via a learned depth gate α to obtain calibrated bone lengths B_Z. The SMPL template is scaled accordingly, and a small MLP regresses the shape parameters s_init. By grounding the initialization in depth‑derived absolute distances, D‑MAPS eliminates scale drift that typically appears during later temporal smoothing.
-
Motion‑Depth Aligned Refinement (MoDAR) – The initial pose and shape are refined through a cross‑modal attention mechanism. Motion tokens, generated by lifting 2D joint detections to pelvis‑centered 3D joints with DSTformer, serve as queries, while the fused features ˜Fₜ act as keys and values. Two stacked cross‑attention blocks enable bidirectional information flow between motion dynamics and depth‑enhanced geometry. The resulting context feature F′ₜ is processed by a compact feed‑forward network and a residual head that predicts updates Δx. A causal temporal filter xₜ = (1‑ρ)xₜ₋₁ + ρ(x₀ + gₜ⊙Δx) with gate gₜ = σ(W F′ₜ) suppresses high‑frequency oscillations while preserving rapid motions.
Training proceeds in two phases. Phase 1 warms up the backbone and the RGB‑depth fusion under motion supervision, using a gated depth integration to ensure stability. Phase 2 jointly optimizes D‑MAPS, MoDAR, and the SMPL regressor with a multi‑task loss comprising mesh, joint, pose, shape, and temporal smoothness terms. No explicit depth‑specific loss is required; depth is used purely as a feature cue.
The method is evaluated on three standard benchmarks: 3DPW, Human3.6M (Protocol 2), and MPI‑INF‑3DHP. On 3DPW the proposed system achieves MPJPE 69.31 mm, PA‑MPJPE 46.68 mm, MPVPE 82.61 mm, and acceleration error 7.14 mm/s², surpassing the previous best (AR‑TS) by 2–3 mm on most metrics. Similar gains are observed on Human3.6M and MPI‑INF‑3DHP, with particularly notable improvements in acceleration error, indicating better temporal stability without over‑smoothing.
Ablation studies confirm the contribution of each component. Adding mask‑guided fusion to an RGB‑only baseline reduces MPJPE from 82.45 mm to 73.12 mm. Incorporating quality‑aware depth further lowers it to 71.05 mm. D‑MAPS alone yields 72.20 mm, while MoDAR alone yields 71.36 mm. The full system, combining all three, reaches the best 69.48 mm. Longer input sequences consistently improve both spatial error and acceleration, demonstrating the benefit of richer temporal context.
The paper’s strengths lie in (i) a principled integration of depth cues that directly addresses metric ambiguity, (ii) a lightweight yet effective gating strategy that makes the system robust to noisy depth, (iii) a clear separation of metric‑aware initialization and temporal refinement, and (iv) thorough empirical validation across multiple datasets. Potential limitations include reliance on the quality of the pretrained depth estimator (failure under extreme lighting or reflective surfaces could degrade performance) and confinement to the SMPL model, which may struggle with non‑standard clothing or multi‑person scenes. Future work could explore self‑supervised depth refinement, extensions to multi‑person settings, and richer shape priors for clothed humans. Overall, the paper presents a compelling and well‑engineered solution that advances the state of the art in monocular video human mesh recovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment