Improved Single Camera BEV Perception Using Multi-Camera Training

Improved Single Camera BEV Perception Using Multi-Camera Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Bird’s Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.


💡 Research Summary

The paper tackles a very practical problem in mass‑produced autonomous vehicles: how to obtain high‑quality bird’s‑eye‑view (BEV) maps when only a single front‑facing camera is available at inference time. State‑of‑the‑art BEV perception models such as BEVFormer achieve excellent results by ingesting images from a full surround‑view camera rig, but the cost of multiple cameras is prohibitive for large‑scale production. The authors therefore propose a training strategy that deliberately creates a mismatch between training (multi‑camera) and inference (single‑camera) conditions and then bridges that gap with three complementary techniques.

  1. Inverse Block Masking – Inspired by recent self‑supervised masking methods, the authors progressively mask out the five non‑front cameras during training. The masking ratio is increased in 20 % steps every few epochs, starting with a modest 20 % mask and ending with 100 % mask, leaving only the front view visible. The masks are rectangular blocks that preserve contiguous visible regions, allowing the network to infer hidden parts from surrounding context. In the final epochs, ground‑truth (GT) bounding boxes that belong to completely masked views are ignored, preventing the model from learning spurious detections outside the visible field.

  2. Cyclic Learning‑Rate Schedule – Because the data distribution changes dramatically as the masking ratio rises, a standard cosine‑annealing schedule is insufficient. The authors adopt a cyclic LR schedule that restarts with a relatively high learning rate at the beginning of each masking stage, then gradually decays within the stage. This gives the optimizer enough “energy” to adapt to the new distribution while still allowing fine‑tuning as the mask becomes more aggressive. The final stage (full mask) uses a very low LR to avoid over‑fitting to the single‑camera scenario.

  3. BEV Feature Reconstruction Loss – Each training sample is processed twice: once with no masking (full six‑camera input) and once with the current mask applied. An L2 loss is computed between the BEV feature embeddings of the two passes, encouraging the masked‑input network to produce features that are close to those obtained from the full‑view input. This loss acts as a form of knowledge distillation from the multi‑camera teacher to the single‑camera student, preserving spatial information that would otherwise be lost.

The architecture builds on BEVFormer with a ResNet‑50 backbone and three transformer layers (the “mid‑size” configuration). Experiments are conducted on the nuScenes dataset, using the standard detection metrics (mAP, NDS) and segmentation metric (mIoU). Three training regimes are compared: (i) a baseline trained on a single front camera, (ii) a baseline trained on all six cameras, and (iii) the proposed method that combines masking, cyclic LR, and reconstruction loss.

Results show that the proposed method outperforms both baselines by a large margin. Compared to the single‑camera baseline, it improves NDS by roughly 20 % and mAP by about 25 % (the paper reports a 414 % relative gain in mAP, reflecting a drastic reduction in false positives). Semantic segmentation quality, measured by mIoU, rises by 19 %. Qualitative examples illustrate that the model can correctly predict objects and lane markings that lie outside the front‑camera field of view, reducing hallucinations that are common in single‑camera BEV models.

The study demonstrates that deliberately exposing the network to increasingly limited views during training, coupled with a loss that forces consistency with full‑view features, enables a single‑camera model to inherit much of the spatial awareness of a multi‑camera system. This approach is cost‑effective for production vehicles and can be extended to multimodal settings where lidar or radar are used only during training. Future work may explore finer‑grained masking schedules, alternative backbones, or real‑time inference optimizations, but the current contribution already offers a practical pathway to bring high‑fidelity BEV perception to low‑cost autonomous platforms.


Comments & Academic Discussion

Loading comments...

Leave a Comment