RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.
💡 Research Summary
RAGTrack tackles the long‑standing challenges of RGB‑Thermal (RGBT) object tracking by introducing natural language as a high‑level semantic cue and by leveraging Retrieval‑Augmented Generation (RAG) for temporal reasoning. The authors first address the lack of textual annotations in existing RGBT benchmarks. Using a pipeline built on Multi‑modal Large Language Models (MLLMs), they automatically generate concise natural‑language descriptions for each frame, covering object category, appearance attributes, and motion states. This creates a unified visual‑language dataset without manual labeling effort.
The core of the proposed system consists of three tightly coupled modules.
- Multi‑modal Transformer Encoder (MTE) – Both RGB and thermal (TIR) images are down‑sampled, patch‑embedded, and tokenized. A fixed textual prompt (“A sequence of a
Comments & Academic Discussion
Loading comments...
Leave a Comment