Benchmarking 3D Human Pose Estimation Models under Occlusions

Benchmarking 3D Human Pose Estimation Models under Occlusions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human Pose Estimation (HPE) involves detecting and localizing keypoints on the human body from visual data. In 3D HPE, occlusions, where parts of the body are not visible in the image, pose a significant challenge for accurate pose reconstruction. This paper presents a benchmark on the robustness of 3D HPE models under realistic occlusion conditions, involving combinations of occluded keypoints commonly observed in real-world scenarios. We evaluate nine state-of-the-art 2D-to-3D HPE models, spanning convolutional, transformer-based, graph-based, and diffusion-based architectures, using the BlendMimic3D dataset, a synthetic dataset with ground-truth 2D/3D annotations and occlusion labels. All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that simulates occlusion by adding noise into 2D keypoints based on real detector behavior, and conduct both global and per-joint sensitivity analyses. Our findings reveal that all models exhibit notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Additionally, a per-joint occlusion analysis identifies consistent vulnerability in distal joints (e.g., wrists, feet) across models. Overall, this work highlights critical limitations of current 3D HPE models in handling occlusions, and provides insights for improving real-world robustness.


💡 Research Summary

This paper addresses a critical gap in the evaluation of 3D human pose estimation (3D HPE) systems: their robustness to realistic occlusions. While most recent advances in 3D HPE have been benchmarked on clean, controlled datasets such as Human3.6M, real‑world applications (sports, retail, clinical gait analysis) regularly encounter self‑occlusion and external occlusion, which dramatically degrade performance. To quantify this problem, the authors construct a comprehensive benchmark using the synthetic BlendMimic3D dataset, which uniquely provides frame‑level and joint‑level occlusion annotations.

The benchmark protocol first analyses the error distribution of state‑of‑the‑art 2D keypoint detectors (e.g., OpenPose, HRNet) on real images. Using these statistics, the authors simulate occlusion by injecting joint‑specific Gaussian noise into the 2D keypoints, respecting the observed variance for each joint. Occlusion severity is varied from 0 % to 50 % of joints per frame, creating a controlled spectrum of noisy inputs that closely mimics the behavior of actual detectors under occlusion.

Nine pre‑trained 2D‑to‑3D lifting models are evaluated without any fine‑tuning:

  • Convolutional – VideoPose3D (temporal dilated convolutions)
  • Graph‑based – Dynamic Graph ConvNet (spatio‑temporal graph reasoning)
  • Transformer‑based – PoseFormer, MixSTE, PoseFormer‑V2 (global attention across space and time)
  • Diffusion‑based – D3DP, DiffuPose, FinePose (iterative denoising of multiple 3D hypotheses)

All models were originally trained on Human3.6M, ensuring that the benchmark measures pure generalisation to unseen occlusion patterns. Performance is reported using mean per‑joint position error (MPJPE) and its Procrustes‑aligned counterpart (PA‑MPJPE), both globally and per‑joint.

Key findings:

  1. Universal degradation: All models suffer a steep increase in MPJPE as occlusion rises. At 30 % occlusion the error roughly doubles compared to the clean case, and beyond 40 % it exceeds 120 mm for several architectures.
  2. Distal joint vulnerability: Wrist, ankle, and toe joints consistently incur the highest errors (15–20 mm more than central joints). This aligns with the intuition that distal joints are more likely to be hidden and have smaller image footprints.
  3. Architectural trends:
    • Convolutional models excel on clean inputs but collapse quickly under occlusion, reflecting their reliance on local temporal continuity.
    • Graph‑based networks benefit from skeletal structure constraints, showing a more gradual error increase, yet they still fail when large portions of the skeleton are missing.
    • Transformer models leverage global attention to compensate for partially missing joints, achieving the best relative robustness among the four families, though they are not immune to severe occlusion.
    • Diffusion models, despite their theoretical ability to generate multiple hypotheses, are the most sensitive to noisy 2D inputs; their denoising steps amplify the injected noise, leading to the lowest overall robustness.
  4. Lack of occlusion awareness: None of the evaluated pipelines incorporates an explicit occlusion‑detection module. They treat all input keypoints as valid, effectively interpreting occluded joints as noisy measurements. This design choice is a primary source of the observed performance drop.

The authors also discuss dataset bias and domain shift. Although BlendMimic3D is synthetic, its rich occlusion annotations and diverse pose configurations make it a valuable proxy for real‑world scenarios. Nevertheless, future work should validate findings on real video streams with authentic occlusion patterns and explore temporal continuity of occlusion (e.g., a limb hidden for several consecutive frames).

Implications and future directions:

  • Integrate a lightweight occlusion predictor that flags unreliable joints before lifting.
  • Design loss functions or attention mechanisms that weight joints according to visibility confidence.
  • Combine diffusion‑based hypothesis generation with visibility‑aware priors to prevent error amplification.
  • Train on multi‑dataset mixtures with aligned skeleton definitions to improve cross‑domain generalisation.

In conclusion, this benchmark reveals that current state‑of‑the‑art 3D HPE models, regardless of architectural sophistication, are not yet ready for deployment in environments where occlusions are common. The systematic evaluation and per‑joint sensitivity analysis provided by this work offer a clear roadmap for researchers aiming to build truly robust 3D pose estimation systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment