M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size. We also demonstrate how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M$^2$RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M$^2$RNN outperforms equivalent Gated DeltaNet hybrids by $0.4$-$0.5$ perplexity points on a 7B MoE model, while using $3\times$ smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M$^2$RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M$^2$RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M$^2$RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to $8$ points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.


💡 Research Summary

The paper revisits non‑linear recurrent neural networks (RNNs) as a means to overcome fundamental expressivity limits of modern transformer‑based language models. While transformers excel at parallel processing, their computations belong to the TC⁰ complexity class, which precludes solving tasks that require richer state‑tracking capabilities such as entity tracking, code execution, or permutation‑group reasoning. Linear RNNs and state‑space models (SSMs) improve on this by offering linear‑time recurrence, but they remain confined to TC⁰ or, at best, NC¹ when transition matrices are input‑independent or diagonal. Consequently, they cannot reliably handle hard state‑tracking problems.

The authors propose Matrix‑to‑Matrix RNN (M²RNN), a non‑linear RNN architecture that replaces the traditional vector‑valued hidden state (hₜ∈ℝᵈ) with a matrix‑valued state Hₜ∈ℝ^{K×V}. The core of M²RNN is an outer‑product based state expansion: given an input token xₜ, a key embedding φ(kₜ)∈ℝ^{K} and a value vector vₜ∈ℝ^{V} are computed, and the state is updated as ΔHₜ = φ(kₜ)·vₜᵀ, Hₜ = H_{t‑1} + ΔHₜ. This operation is inherently non‑linear (through φ) yet fully compatible with GPU tensor‑core primitives, allowing high‑throughput matrix multiplications. A forget gate, independent of the recurrent state, is introduced to mitigate vanishing gradients while preserving parallelizability across time steps.

Theoretical analysis shows that M²RNN can simulate deterministic finite‑state automata (DFAs) and solve NC¹‑complete problems such as the S₅ permutation group, thereby surpassing the expressive power of both transformers and linear RNNs. Empirically, M²RNN achieves perfect accuracy on synthetic state‑tracking benchmarks that are provably unsolvable by TC⁰ models.

A major contribution is the demonstration that the performance gap between non‑linear RNNs and transformers is largely due to state size rather than non‑linearity per se. By expanding the hidden state to a matrix, M²RNN attains a capacity comparable to linear RNNs (which already use K×V matrices) while keeping the total parameter count fixed. Experiments varying K and V show substantial perplexity reductions as the matrix dimensions grow, confirming that larger state sizes are critical for strong language modeling.

Hardware‑level optimizations are a central focus. The outer‑product update aligns perfectly with NVIDIA tensor‑core 16×16 matrix kernels, eliminating the padding overhead that plagues prior “FlashRNN” approaches. The authors implement forward and backward kernels in Triton, achieving high on‑chip reuse and reducing HBM traffic. Two tensor‑parallel (TP) strategies are described: a topology‑aware method that requires no extra communication, and a topology‑agnostic method that preserves parameter counts across TP configurations at the cost of synchronization. Both enable scaling M²RNN to multi‑GPU clusters.

In large‑scale language modeling, M²RNN is evaluated both as a standalone recurrent layer and within hybrid architectures that interleave recurrent layers with causal attention. On a 7‑billion‑parameter Mixture‑of‑Experts (MoE) model, replacing a single recurrent layer with M²RNN yields a 0.4–0.5 perplexity improvement over an equivalent Gated DeltaNet hybrid, despite using three times smaller recurrent state sizes. Remarkably, even a single M²RNN layer inserted into an existing hybrid model delivers gains comparable to a full Hybrid M²RNN stack, with negligible impact on training throughput.

Long‑context generalization is assessed on the LongBench benchmark. Hybrid models that combine M²RNN with either Mamba‑2 or Gated DeltaNet achieve up to an 8‑point boost over state‑of‑the‑art hybrid linear‑attention models at both 410 M dense and 7 B MoE scales. This improvement is attributed to the larger, matrix‑valued state that better preserves key‑value associations over thousands of tokens.

The paper also discusses limitations: M²RNN still requires sequential time‑step computation, which can be a bottleneck for extremely long pre‑training sequences; memory consumption grows with K and V, necessitating careful TP configuration; and current optimizations target NVIDIA GPUs, so portability to other hardware may need additional work.

In conclusion, M²RNN demonstrates that non‑linear RNNs, when equipped with matrix‑valued states and outer‑product state expansion, can close the expressivity gap with transformers while leveraging modern GPU hardware for efficient training. The architecture offers a compelling building block for future large‑scale language models that need both high‑throughput training and the ability to perform complex, long‑range stateful computations beyond the TC⁰ barrier.


Comments & Academic Discussion

Loading comments...

Leave a Comment