A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments
Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
💡 Research Summary
This paper conducts a systematic comparison of three state‑of‑the‑art 3D person detection approaches—camera‑only BEVDepth, LiDAR‑only PointPillars, and camera‑LiDAR fusion DAL—on the JRDB dataset, which contains diverse indoor and outdoor scenes captured by a mobile robot on a university campus. The authors first adapt each model to the JRDB training/validation/test split, preserving the full 5‑view camera resolution (480 × 752) and the 360° LiDAR point cloud. BEVDepth uses a Lift‑Splat‑Shoot transformer to produce a bird‑eye‑view representation supervised by LiDAR depth; PointPillars voxelizes point clouds into vertical pillars and applies 2‑D convolutions; DAL projects both image and LiDAR features into a shared BEV grid and decouples image features from the regression head to reduce over‑fitting.
Training is performed on two NVIDIA L40 GPUs with batch size 16, using AdamW and a step or cyclic learning‑rate schedule (20 epochs for BEVDepth and PointPillars, 15 epochs for DAL). Evaluation follows the official JRDB protocol: average precision (AP) at IoU thresholds 0.3 and 0.5, considering only 3‑D boxes that contain more than 10 LiDAR points and lie within 25 m of the robot.
Results show that DAL consistently outperforms the single‑modality baselines. On the test set, DAL achieves AP0.3 = 73.18 % and AP0.5 = 24.73 %, whereas BEVDepth reaches 45.62 % / 12.34 % and PointPillars 60.41 % / 19.87 %. A detailed analysis by distance (0‑5 m, 5‑10 m, 10‑15 m, 15‑25 m) reveals that the fusion model maintains higher recall at longer ranges, especially in the 10‑15 m band where it exceeds the LiDAR‑only model by roughly 10 %p. Occlusion analysis (fully visible, mostly visible, severely occluded, fully occluded) demonstrates that BEVDepth’s performance drops sharply as occlusion increases, while DAL degrades more gracefully, retaining useful detections even under severe occlusion.
To assess robustness, the authors synthesize twelve corruption types derived from the automotive robustness benchmark (Dong et al., 2023): Gaussian noise on images and LiDAR, cut‑out, crosstalk, density reduction, field‑of‑view loss, spatial and temporal misalignment, fog, and sunlight. Each corruption is applied at three severity levels. The fusion model shows the smallest average performance loss across most corruptions, confirming the advantage of multimodal information. However, DAL is notably vulnerable to LiDAR spatial misalignment (rotation/translation perturbations) and crosstalk noise, where AP can drop by 15‑20 %p. BEVDepth is highly sensitive to image‑level noise and fog, with AP reductions exceeding 30 %p at the highest severity. PointPillars is most affected by density reduction and FOV loss, reflecting its reliance on sufficient point coverage.
The study concludes that sensor fusion delivers superior accuracy and resilience in complex indoor/outdoor environments, but it also imposes strict requirements on calibration and synchronization. Camera‑only methods remain attractive for low‑cost deployments but struggle with occlusion and distance. LiDAR‑only approaches provide robust depth but suffer when point density is compromised. Future work should explore domain‑adaptation techniques, automated online calibration, and larger benchmarks that incorporate varied weather and lighting conditions to further close the gap between research and real‑world safety‑critical applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment