Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles
Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle’s front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a “part-versus-whole” semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.
💡 Research Summary
This paper tackles the problem of identifying the semantic object that a driver is looking at, using the vehicle’s front‑view camera image together with a time‑synchronized gaze coordinate. The authors frame the task as a point‑of‑gaze object identification problem and evaluate three fundamentally different vision paradigms: (1) a traditional object‑detection pipeline based on YOLOv13, (2) a segmentation‑assisted two‑stage pipeline that first generates a mask with SAM2 and then classifies the masked crop with EfficientNetV2, and (3) query‑based Vision‑Language Models (VLMs) using the Qwen2.5‑VL series (7‑billion and 32‑billion parameter versions).
To enable a fair comparison, a new benchmark was built from the BDD100K dataset. The authors selected city‑street images and manually annotated gaze points on five safety‑critical classes (person, car, bus, truck, traffic light) under four challenging environmental conditions: clear daytime, clear night, rainy daytime, and rainy night. Each sample consists of an image, a gaze coordinate, and the ground‑truth class of the object at that coordinate.
Experiments measured macro F1‑score, per‑class precision/recall, and inference latency. YOLOv13 achieved a macro F1 of 0.86 with ~12 ms latency, demonstrating strong real‑time performance but struggling with very small or partially occluded objects. The SAM2 + EfficientNetV2 pipeline suffered from a “part‑versus‑whole” semantic gap: the mask often captured only a sub‑part (e.g., a wheel) while the driver’s intent was the whole vehicle, leading to a macro F1 of only 0.55 and low recall (0.42).
The VLM approaches showed the most interesting trade‑offs. The 7‑billion‑parameter Qwen2.5‑VL‑7B reached a macro F1 of 0.79, while the 32‑billion‑parameter Qwen2.5‑VL‑32B improved to 0.88, albeit with higher latency (~85 ms on a high‑end GPU). Notably, the large VLM excelled at identifying small, safety‑critical objects such as traffic lights under low‑light and rainy conditions, achieving a recall of 0.81 where YOLOv13 dropped below 0.60. This robustness stems from the VLM’s ability to perform visual grounding and leverage rich language‑based context, effectively reasoning about the scene beyond the raw visual features.
The authors discuss the fundamental trade‑off revealed by the study: traditional detectors offer deterministic, low‑latency inference suitable for real‑time ADAS, whereas large VLMs provide richer contextual understanding and superior performance on challenging objects but at a computational cost that may exceed current in‑vehicle hardware limits. The segmentation‑assisted approach, despite its theoretical pixel‑level precision, is hampered by the need for a separate classifier and the semantic mismatch between mask and driver intent.
Key contributions include (1) the design of a systematic evaluation framework for three distinct paradigms, (2) the first in‑depth assessment of large VLMs for gaze‑based object identification, (3) the release of a manually annotated, multi‑condition benchmark derived from BDD100K, and (4) a detailed analysis of performance versus efficiency across environmental scenarios and object categories.
The paper concludes that large VLMs, especially the 32‑billion‑parameter model, represent a promising direction for human‑centric driver monitoring systems, particularly for safety‑critical, low‑visibility scenarios. Future work is suggested on model compression, real‑time VLM inference, multimodal fusion of gaze heatmaps with detection outputs, and extending the framework to temporal gaze sequences for richer cognitive state estimation.
Comments & Academic Discussion
Loading comments...
Leave a Comment