Decision MetaMamba: Enhancing Selective SSM in Offline RL with Heterogeneous Sequence Mixing
Mamba-based models have drawn much attention in offline RL. However, their selective mechanism often detrimental when key steps in RL sequences are omitted. To address these issues, we propose a simple yet effective structure, called Decision MetaMamba (DMM), which replaces Mamba’s token mixer with a dense layer-based sequence mixer and modifies positional structure to preserve local information. By performing sequence mixing that considers all channels simultaneously before Mamba, DMM prevents information loss due to selective scanning and residual gating. Extensive experiments demonstrate that our DMM delivers the state-of-the-art performance across diverse RL tasks. Furthermore, DMM achieves these results with a compact parameter footprint, demonstrating strong potential for real-world applications. Code is available at https://github.com/too-z/decision-metamamba
💡 Research Summary
Decision MetaMamba (DMM) addresses a critical limitation of recent Mamba‑based offline reinforcement learning (RL) models: the selective scanning and gating mechanisms can suppress essential information when key tokens—states, actions, or return‑to‑go (rtg)—receive near‑zero activations. This information loss is especially harmful in offline RL, where the model must infer optimal actions from static trajectories without environment interaction. The authors first visualize activation heatmaps showing that Mamba’s selective SSM often down‑weights state and rtg components, confirming the problem.
To remedy this, DMM introduces two complementary components. The first is a Dense Sequence Mixer (DSM), which replaces Mamba’s depth‑wise 1‑D convolution. DSM takes a local window of k consecutive tokens, flattens all channels, and applies a dense linear projection. By mixing all channels simultaneously, DSM captures short‑range dependencies more effectively than the original per‑channel convolution, aligning with the Markov property of RL dynamics. The second component is a modified Mamba block that no longer contains the depth‑wise convolution; instead, the DSM is placed at the front of the block. The processing pipeline is: input → LayerNorm → DSM → residual addition → LayerNorm → ModifiedMamba → final residual addition. This ordering ensures that local relationships are reinforced before the selective SSM operates, while the residual connections preserve any information that might be attenuated by gating. Because Mamba’s state‑space formulation already encodes positional information, DMM does not require extra positional encodings, further reducing overhead.
Extensive experiments were conducted on the D4RL benchmark, covering dense‑reward MuJoCo tasks (Hopper, Walker2d, HalfCheetah) and sparse‑reward tasks (AntMaze and Franka Kitchen). In dense‑reward settings, DMM matches or exceeds the best prior methods, achieving the highest average rank across Hopper, Walker2d, and HalfCheetah. In sparse‑reward environments, DMM shows dramatic gains: 91.0 vs. 68.0 on AntMaze‑um, 94.0 vs. 62.0 on AntMaze‑ud, and 83.0 vs. 59.3 on the mixed Kitchen subset, outperforming all recent Transformer‑ and SSM‑based baselines (DT, QLDT, EDT, DC, DS4, DM). Notably, DMM attains these results with a substantially smaller parameter count—approximately 30‑40 % fewer than DS4 and DM—while retaining Mamba’s linear‑time complexity, making it suitable for edge devices and real‑time robotic platforms.
Ablation studies confirm that both DSM and the modified Mamba are essential: removing DSM or using the original Mamba degrades performance markedly. Gradient‑norm analysis demonstrates that DMM maintains non‑vanishing gradients even for tokens far from the current step, indicating robust long‑range learning.
In summary, Decision MetaMamba offers a simple yet powerful redesign of Mamba for offline RL: a dense local mixer that captures short‑range dynamics, a front‑loaded placement preserving information before selective scanning, and residual pathways that mitigate gating‑induced loss. The resulting architecture delivers state‑of‑the‑art performance across a variety of offline RL benchmarks while being more parameter‑efficient, highlighting its promise for practical deployment in resource‑constrained robotic systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment