Improving Visual Object Tracking through Visual Prompting
Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that produces an initial visual prompt highlighting potential target locations. The foundation model is then used to refine the prompt based on appearance similarities between candidate objects and reference templates across potential targets. After refinement, the visual prompt better highlights potential target locations and reduces irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate instance-aware feature maps guided by the visual prompts, which are incrementally and automatically updated during tracking, thereby effectively suppressing distractors. Extensive experiments across multiple benchmarks indicate that PiVOT, with the proposed prompting mechanism, can suppress distracting objects and improve tracking performance.
💡 Research Summary
The paper introduces PiVOT, a novel visual‑prompting framework for generic object tracking (GOT) that leverages large‑scale vision foundation models—CLIP for contrastive similarity and DINOv2‑based ViT‑L for dense feature extraction. The central problem addressed is the limited discriminative capability of existing trackers when faced with distractors, illumination changes, occlusions, or unseen object categories. PiVOT tackles this by automatically generating and refining a visual prompt that highlights candidate target locations and suppresses irrelevant regions during inference.
Architecture Overview
- Backbone & Feature Extraction – A frozen ViT‑L backbone pretrained with DINOv2 extracts high‑dimensional features from the current frame and two reference frames (the initial template and a template updated from the previous frame).
- Prompt Generation Network (PGN) – A lightweight network, structurally similar to the tracking head, computes a correlation‑based score map between current‑frame features and reference templates. This score map serves as an initial visual prompt that marks potential target positions.
- Test‑time Prompt Refinement (TPR) – Inserted only during inference, TPR crops multiple Regions‑of‑Interest (RoIs) from the current frame based on the PGN score map. Each RoI is fed to CLIP’s image encoder; cosine similarity between RoI embeddings and reference‑template embeddings is calculated. RoIs with higher similarity receive boosted scores on the prompt, while low‑similarity regions are attenuated. This step exploits CLIP’s zero‑shot, class‑agnostic contrastive knowledge to refine the prompt without any human‑provided annotations.
- Relation Modeling (RM) Module – The refined visual prompt is treated as a spatial mask that modulates the current‑frame feature map. RM multiplies (or otherwise fuses) the prompt with the feature map, enhancing activations at prompt‑highlighted locations and suppressing activations elsewhere. The resulting prompt‑guided features are fed to the final tracking head.
- Tracking Head – A transformer‑based model predictor (as in ToMP) predicts convolutional filter weights and generates the final response map, yielding the target’s bounding box.
Training Strategy
During training, PGN and RM are learned jointly with the tracking head while the backbone remains frozen. Only a lightweight adapter (≈1 M trainable parameters) is optimized, dramatically reducing GPU memory and training time compared to full‑backbone fine‑tuning. The prompt refinement step is omitted during training; the network learns to produce a useful initial prompt that can later be sharpened by CLIP at test time.
Key Contributions & Findings
- Automatic Visual Prompt Generation: No manual prompt annotations are required; the system derives prompts from correlation scores and CLIP‑based similarity.
- Contrastive Knowledge Transfer: By feeding CLIP‑refined prompts into RM, the method transfers CLIP’s category‑level contrastive knowledge to the instance‑level tracking task, improving discrimination between the target and distractors, even for unseen categories.
- Efficiency: Freezing the large ViT‑L backbone and using a tiny adapter keeps training cost low while still benefiting from the dense, generalized representations of a foundation model.
- Performance Gains: Extensive experiments on LaSOT, GOT‑10k, TrackingNet, TNL2K, and other benchmarks show consistent improvements of 2–4 % in AUC/EAO over the baseline ToMP tracker. Gains are especially pronounced under heavy occlusion, rapid appearance change, and background clutter.
Limitations & Future Work
The TPR stage introduces additional CLIP inference, which can affect real‑time speed. The quality of the refined prompt depends on the quality of the initial RoI proposals; failure to generate good candidates may limit the benefit. Future directions include (i) designing a lightweight contrastive module to replace CLIP for faster inference, (ii) exploring multimodal prompts (e.g., text + image) to further enrich the discriminative signal, and (iii) integrating adaptive RoI proposal mechanisms to make the system more robust to extreme motion or scale changes.
Conclusion
PiVOT demonstrates that a visual‑prompting paradigm—where a prompt is generated, refined by a foundation model, and used to modulate feature representations—can substantially boost the discriminative power of generic object trackers. By bridging the gap between class‑level contrastive knowledge (CLIP) and instance‑level tracking, PiVOT offers a practical and scalable route to more robust, adaptable visual tracking without the heavy cost of full backbone fine‑tuning.
Comments & Academic Discussion
Loading comments...
Leave a Comment