Can Vision-Language Models Answer Face to Face Questions in the Real-World?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

💡 Research Summary

The paper tackles the long‑standing AI goal of real‑time, face‑to‑face visual question answering (VQA) by introducing a novel benchmark, the Qualcomm Interactive Video Dataset (QIVD), and by evaluating existing large multimodal models (LMMs) on this task. QIVD consists of 2,900 short videos (average length 5.1 s, 30 fps, 640×382 resolution) recorded by crowd workers using a camera and microphone on a mobile device. While recording, participants ask an open‑ended question in natural speech that refers to the ongoing visual scene (e.g., pointing gestures, actions, objects). After collection, each video is annotated with (1) a human‑generated transcript of the spoken question, (2) a concise textual answer, and crucially (3) a timestamp indicating the exact moment when sufficient visual‑auditory evidence is available to answer the question correctly. This “answer‑when” annotation forces models to decide not only what to answer but also when to answer, a capability largely absent from existing VQA benchmarks.

The dataset is further enriched with a taxonomy of 13 semantic categories (action detection, counting, object attributes, deictic reference, audio‑visual reasoning, OCR, subjective judgment, etc.) and five question types (what, how, where, deictic expressions, etc.). Statistics show a vocabulary of 3,624 unique words, average question length of 6.09 words, and average answer length of 7.23 words. The distribution of categories is balanced enough to probe a wide range of situated reasoning skills.

The authors first evaluate several state‑of‑the‑art LMMs—including GPT‑4o, LLaVA‑1.5, Video‑LLM‑online, FlashVStream, and other open‑source models—by feeding them the raw video frames and the transcribed question without any task‑specific adaptation. Performance is uniformly poor: overall accuracy hovers below 30 %, far from human performance (~92 %). Detailed error analysis reveals three dominant failure modes: (1) Multimodal fusion in real time – models struggle to combine visual and auditory streams on the fly, leading to misinterpretation of deictic references or actions that unfold after the question is spoken; (2) Temporal decision‑making – models typically answer immediately after the question ends or at the video’s termination, ignoring the provided answer‑timestamp cue; (3) Situational common sense – questions that require everyday knowledge (e.g., “Is this pan being used correctly?”) are often answered incorrectly or left unanswered.

To assess whether the gap can be narrowed, the authors fine‑tune the same models on the full QIVD training split (≈2,600 videos) for five epochs, using a streaming‑compatible architecture that processes frames sequentially and incorporates the answer‑timestamp as a supervisory signal for “when to speak.” After fine‑tuning, average accuracy rises to ~55 %, with notable gains in categories that heavily rely on temporal grounding: action counting (+20 pp), object referencing (+18 pp), and deictic expressions (+15 pp). The improvement is attributed to (a) exposure to real‑time streaming inputs, (b) explicit learning of answer timing, and (c) direct supervision on a diverse set of everyday visual situations.

Nevertheless, even the fine‑tuned models remain well below human levels, especially on tasks that demand audio‑visual correlation (e.g., “Which object is making the sound?”) and subjective judgments (e.g., “Does this dish look tasty?”). Moreover, the current experiments are conducted with offline batch fine‑tuning; true low‑latency inference (sub‑100 ms per frame) and continuous learning in an online setting have not been demonstrated.

The paper’s contributions are fourfold: (1) the creation of QIVD, the first benchmark that couples real‑time video, speech, and answer‑timing annotations for face‑to‑face VQA; (2) a systematic evaluation of existing LMMs that uncovers critical weaknesses in real‑time multimodal reasoning; (3) evidence that task‑specific fine‑tuning on QIVD can substantially close the performance gap; (4) a simple baseline streaming pipeline that departs from the traditional offline paradigm.

In discussion, the authors outline future research directions: developing low‑latency streaming architectures (e.g., frame‑wise transformer or recurrent multimodal encoders), improving multimodal fusion mechanisms that jointly attend to audio and visual streams, integrating external knowledge bases for common‑sense reasoning, and exploring continual learning setups where the model can adapt on‑the‑fly during live interaction. By providing QIVD and the associated analysis, the work establishes a concrete platform for advancing AI assistants, humanoid robots, and video‑call chatbots toward truly situated, real‑time conversational capabilities.

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment