SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking
Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.
💡 Research Summary
The paper introduces SMTrack, a novel visual tracking framework that leverages a state‑aware version of the Mamba state‑space model (SSM) to achieve efficient long‑range temporal modeling. Traditional CNN‑based trackers rely on online filter optimization, which introduces complex pipelines and high computational overhead, while Transformer‑based trackers concatenate dynamic templates into the input, incurring quadratic O(L²) attention costs that limit the number of templates that can be used. Both approaches struggle to capture long‑range dependencies without excessive computation or custom modules.
Mamba, a recent SSM originally proposed for natural language processing, addresses these issues by using an input‑driven selective scan that enables each token to interact with all previous tokens with linear complexity in sequence length. However, the original Mamba shares a single timescale parameter Δ across all hidden states, which restricts the model’s ability to differentiate among diverse temporal cues such as target appearance changes, background variations, and distractors—critical factors for robust visual tracking.
To overcome this limitation, the authors propose the Selective State‑Aware Space Model (SASM). SASM assigns a distinct timescale Δ to each hidden‑state dimension, allowing different states to evolve on different temporal scales. Consequently, some states can focus on rapidly changing target features, while others retain slowly varying background context. Moreover, SASM introduces explicit interactions among hidden states after each image token, building dense dependencies that further enrich the temporal representation.
The training and inference pipeline is built around a “Temporal Causal Scanning” strategy. During training, past target templates are scanned once in chronological order; the hidden state is updated at each step, and the current search region directly interacts with the accumulated hidden state to produce predictions. Because the templates are scanned only once, the computational cost grows linearly with the number of frames (O(T)). At inference time, the tracker does not re‑scan templates for each new frame; instead, it propagates the hidden state forward, allowing the current search region to access all previously observed information without redundant computation. Within each frame, a bidirectional scan is still employed to capture spatial context, but cross‑frame interactions remain causal.
The overall architecture consists of: (1) a CNN backbone that extracts features from both the target template(s) and the search region; (2) a stack of SASM blocks that integrate temporal cues with linear complexity; and (3) a decoder head that predicts bounding‑box coordinates and confidence scores. The SASM blocks extend the original Mamba by incorporating state‑wise Δ and hidden‑state interaction modules, adding negligible parameter overhead while substantially improving expressive power.
Extensive experiments on eight benchmark datasets—including OTB‑2015, UAV123, LaSOT, TrackingNet, and others—demonstrate that SMTrack outperforms prior SSM‑based methods (e.g., MambaVT) and state‑of‑the‑art CNN/Transformer trackers in terms of success rate (AUC) and precision (F‑score). Notably, SMTrack achieves these gains with significantly lower FLOPs and higher real‑time throughput (exceeding 30 FPS on a single GPU), confirming its suitability for resource‑constrained platforms. Ablation studies reveal that both the state‑wise timescale and the hidden‑state interaction contribute positively to performance, and that increasing the number of stored templates does not break the linear complexity guarantee.
In summary, SMTrack delivers three key innovations: (1) a selective state‑aware space model that captures diverse temporal cues via state‑wise timescales; (2) a linear‑complexity temporal causal scanning mechanism that provides a clean, end‑to‑end training and inference pipeline without custom modules; and (3) an efficient hidden‑state propagation scheme that enables each frame to access all past information without repeated template scanning. These contributions collectively advance the state of visual tracking, offering a high‑accuracy, low‑cost solution for dynamic, real‑world applications such as autonomous driving, robotics, and surveillance.
Comments & Academic Discussion
Loading comments...
Leave a Comment