Revisiting the Evaluation of Deep Neural Networks for Pedestrian Detection
Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.
💡 Research Summary
The paper addresses a critical shortcoming in the evaluation of pedestrian detectors for automated driving systems (ADS). Current benchmarks, such as the log‑average miss rate (LAMR) computed on the “reasonable” subset of the CityPersons dataset, rely only on visibility and height thresholds. This coarse filtering can ignore pedestrians that are directly in front of the vehicle—precisely the cases that matter most for safety.
To remedy this, the authors exploit the dense semantic and instance segmentation annotations of the Cityscapes dataset. They define a rule‑based taxonomy that splits ground‑truth pedestrians into eight error categories: foreground (F), background (B), environmental occlusion (E), crowd occlusion (C), ambiguous occlusion (A), scale errors (S), localization errors (L), and ghost detections (H). Occlusion categories are derived from three visibility measures: overall visibility ϕ, environment‑based visibility ϕₑ (using 20 semantic classes that can block a person), and instance‑based visibility ϕ𝚌 (measuring overlap with other pedestrian instances). Thresholds λϕ, λₑ, λ𝚌 are set empirically (mostly zero) to flag occluded persons. The remaining clearly visible persons are split into foreground and background by a height threshold λ_f, which is calibrated from a simplified emergency‑braking distance model (22 m corresponds to ≈190 px in Cityscapes).
False positives are further divided: scale errors (correct center but wrong box size), localization errors (center offset but IoU ≥ 0.25), and ghost detections (random boxes unrelated to any pedestrian). Ghosts are considered the most disruptive for an ADS because they can trigger unnecessary braking or lane changes.
Based on this taxonomy the authors introduce a filtered LAMR, denoted LAMR_P, which computes miss rates separately for foreground and background and reports occlusion‑specific false‑negative counts. This metric provides a safety‑oriented view of detector performance, complementing the traditional LAMR_r that treats all errors equally.
For empirical validation, a Generic Pedestrian Detector (GPD) is built. Four popular backbones—CSP‑ResNet‑50, FPN‑ResNet‑50, MDLA‑UP‑34, and BGC‑HRNet‑w32—are equipped with three perception heads (center, scale, offset) using 3×3 convolutions. Training uses Adam (learning rate 1e‑4 after warm‑up), 50 k iterations, image size 640×1028, confidence threshold 0.01, and NMS at 0.5. No extra data beyond CityPersons is used.
Results show that all four backbones achieve state‑of‑the‑art LAMR_r (the best being 8.6 % for CSP‑ResNet‑50) while dramatically reducing safety‑critical miss rates: foreground false‑negative rates drop by more than 30 % compared with prior work, and ghost false‑positive rates are substantially lower. The analysis demonstrates that the proposed error categories expose nuanced failure modes that are invisible to conventional metrics.
The paper’s contributions are threefold: (1) a systematic, segmentation‑driven error taxonomy that distinguishes between different occlusion and false‑positive types; (2) a filtered LAMR metric that aligns evaluation with ADS safety requirements; (3) a simple yet effective detector architecture that reaches SOTA performance without additional training data.
Limitations include the reliance on empirically set thresholds, which may need retuning for other cities, weather conditions, or sensor setups; the lack of a community‑wide standard for the filtered LAMR, making cross‑paper comparisons difficult; and the GPD’s design for relatively low‑resolution inputs, which could limit real‑time deployment on high‑resolution streams.
Overall, the work shifts pedestrian‑detector evaluation from a generic accuracy focus toward a safety‑centric perspective, providing tools that could become standard in ADS validation pipelines once broader adoption and further generalization studies are performed.
Comments & Academic Discussion
Loading comments...
Leave a Comment