StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation
Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm effectively approximates learning on infinitely long sequences, enabling the model to achieve long-horizon memory retention. Experiments across both MAZE and ProcTHOR environments demonstrate that StateLinFormer significantly outperforms its stateless linear-attention counterpart and standard Transformer baselines with fixed context windows. Notably, as interaction length increases, persistent stateful training substantially improves context-dependent adaptation, suggesting an enhancement in the model’s In-Context Learning (ICL) capabilities for navigation tasks.
💡 Research Summary
**
The paper “StateLinFormer: Stateful Training Enhancing Long‑term Memory in Navigation” addresses a fundamental limitation of current embodied navigation models: the inability to retain and exploit information over very long interaction horizons. Traditional modular pipelines such as SLAM provide explicit maps but lack flexibility, while recent end‑to‑end Transformer‑based agents encode rich semantic priors but are constrained by fixed context windows and by a training protocol that resets internal states at every batch boundary. Consequently, these models cannot accumulate persistent memory across extended deployments, leading to redundant exploration and poor adaptation to evolving environments.
To overcome this gap, the authors propose two intertwined ideas. First, they adopt a linear‑attention architecture (often called a kernelized or Performer‑style attention) that maintains a fixed‑size memory matrix Mₜ updated incrementally as Mₜ = Mₜ₋₁ + φ(kₜ)vₜᵀ, where φ(·) is a feature‑mapping function. This design yields constant O(1) memory and computational cost per time step, making it scalable to arbitrarily long sequences. Second, and most importantly, they introduce a “stateful training” paradigm. In conventional training, the memory is re‑initialized to zero for each mini‑batch, so the model only ever sees zero‑state contexts during optimization. In stateful training, the final memory state of batch b is carried over as the initial state of batch b + 1 (Mᵇ_T → Mᵇ⁺¹_0). The loss is still computed per batch, but gradients are truncated within the batch for tractability. This simple change aligns the distribution of training contexts with the distribution encountered at deployment: the model is optimized under a stationary distribution of memory states induced by its own recurrent dynamics rather than under a degenerate zero‑state distribution.
The authors formalize this shift in the optimization objective. Stateless training minimizes Eτ∼D L(θ; M₀ = 0), whereas stateful training approximates minθ Eτ∼D, M∼dθ L(θ; M), where dθ is the memory‑state distribution generated by the model’s dynamics over long trajectories. Under mild ergodicity assumptions, dθ converges to a stationary distribution, ensuring that the model experiences a diverse set of memory contexts during learning.
Methodologically, the model follows a modular encoder‑decoder pipeline inspired by the SPOC framework. Textual instructions, navigation‑camera images, and manipulation‑camera images are encoded separately (goal encoder, image encoder, visual encoder) and fused into a unified observation vector oₜ. The decoder receives the concatenation of oₜ and the previous action aₜ₋₁, together with the persistent memory Mₜ₋₁, and produces a hidden state hₜ and an updated memory Mₜ. The action distribution is obtained via a softmax over an MLP applied to hₜ. The only architectural novelty lies in the handling of Mₜ during training, as described above.
Experimental evaluation is conducted on two procedurally generated indoor domains: a 15 × 15 grid‑based Maze (pixel‑style semantic observations) and ProcTHOR, a photorealistic 3D simulator built on AI2‑THOR with high‑resolution egocentric RGB inputs. To stress long‑term adaptation, the authors introduce a new benchmark called Continual Object Navigation (CON). CON stitches together many object‑goal episodes within the same environment, removes explicit goal specifications (the next goal is revealed only after completing the current one), and allows goal repetitions. This setting mimics a household robot that remains in a single room while gradually learning its layout and object locations.
Results show that StateLinFormer consistently outperforms three baselines: (1) a stateless linear‑attention model with identical architecture and parameter count, (2) a standard Transformer with a fixed context window, and (3) other memory‑augmented approaches such as ReLIC and Memo that still reset memory between trajectories. Metrics include success rate, SPL (Success weighted by Path Length), and goal‑repetition efficiency. Across both Maze and ProcTHOR, StateLinFormer achieves improvements of roughly 12‑18 % in success rate and 15 % in SPL over the stateless counterpart. Importantly, as the interaction horizon in CON grows, the performance gap widens, indicating that the persistent memory not only stores past observations but also facilitates emergent in‑context learning (ICL): the model adapts to new goals without any parameter updates, simply by leveraging the accumulated memory.
Ablation studies explore the impact of batch length, the choice of feature mapping φ(·), and the effect of fully resetting memory. Shorter batches diminish the benefit of stateful training, while complete resets erase the ICL advantage. Using learned kernels instead of random feature maps preserves the core gains, suggesting that the stateful protocol is the primary driver. Visualization of the memory matrix reveals that early exploration information (e.g., discovered room layouts) persists and is reused when new goals appear, confirming the intended long‑term retention.
In summary, StateLinFormer demonstrates that aligning the training protocol with the continual nature of embodied deployment—by preserving memory across batches—enables linear‑attention models to achieve genuine long‑term memory and in‑context adaptation while retaining the computational efficiency of O(1) attention. The work opens avenues for applying stateful training to other sequential domains (e.g., language modeling, video understanding) and for integrating external persistent storage or multi‑agent coordination mechanisms in future research.
Comments & Academic Discussion
Loading comments...
Leave a Comment