Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting
Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.
💡 Research Summary
The paper investigates the relationship between linear‑attention Transformers and vector autoregressive (VAR) models, focusing on time‑series forecasting (TSF). It first shows that a single linear‑attention layer can be mathematically reformulated as a dynamic VAR system. Starting from the standard linear‑attention equation
(o_t = \sum_{i=1}^{t} v_i^{\top} q_t , k_i),
the authors define a rank‑1 weight matrix (A_{t,i}=v_i^{\top} q_t). By transposing, the output becomes
(o_t^{\top}= \sum_{i=1}^{t} A_{t,i} k_i^{\top}),
which is exactly a VAR representation where the “observations’’ are the key vectors (k_i) and the coefficients are dynamically generated at each prediction step. This connection differs from prior views that treat linear attention as an RNN or fast‑weight programmer; here the emphasis is on the lag‑based, interpretable structure of VAR.
The authors then argue that modern multi‑layer decoder‑only Transformers are misaligned with the autoregressive forecasting objective. Three sources of mismatch are identified: (1) loss mis‑specification – a VAR model would require each layer to perform a forward shift, but the Transformer loss only enforces a single shift at the final output, forcing each layer to learn fractional shifts; (2) residual‑shortcut and pre‑normalization – the shortcut bypasses the original observation (k_i), while the attention operates on a layer‑normalized version, breaking the direct link between past observations and current predictions; (3) uneven weighting of past lags – deeper layers cause the representation of earlier tokens to drift away from the raw observations, biasing the effective VAR coefficients.
To resolve these issues, the paper proposes a structural re‑arrangement: the MLP and attention modules are reordered (or interleaved) so that attention receives the raw key space, and the residual shortcut shares the same normalization as the attention input. This design forces each layer to obey the same recursive equation as a VAR step, preserving balanced lag weights and enabling a clear “temporal influence path” that traverses up to (l-1) intermediate nodes in an (l)-layer stack.
Building on this aligned architecture, the authors introduce Structural Aligned Mixture of VAR (SAMoVAR). SAMoVAR retains the linear‑attention kernel but explicitly adds an identity matrix to the dynamic weight (A_{t,i}) to form a “key‑shortcut”:
(k_{t+1}=k_t + o_t + u_t = \sum_i (A_{t,i}+I_{i=t}) k_i + u_t).
Thus each layer implements a VAR(·) update with dynamically generated coefficients while preserving the original observation. The model remains (O(N)) in time and space because it still uses linear attention, yet it provides explicit, interpretable coefficient matrices for every lag.
Empirical evaluation is conducted on several benchmark multivariate time‑series datasets (e.g., Electricity, Traffic, Exchange‑Rate, Weather). SAMoVAR is compared against state‑of‑the‑art TSF models such as Informer, Autoformer, N‑HiTS, and LogTrans. Results show consistent improvements of 3–7 % in MSE/MAE, with especially large gains on datasets where cross‑variable dependencies are strong. Moreover, the learned VAR coefficient matrices align with known domain relationships, demonstrating interpretability. Parameter count and FLOPs are reduced by roughly 30 % relative to vanilla Transformer baselines, confirming computational efficiency.
In summary, the paper makes three key contributions: (1) a rigorous derivation linking linear attention to dynamic VAR models; (2) a diagnosis of structural mismatches in existing multi‑layer Transformers for autoregressive forecasting; (3) a redesigned architecture (SAMoVAR) that aligns the Transformer’s computation with VAR objectives, delivering superior accuracy, interpretability, and efficiency. The work opens avenues for further extensions such as nonlinear VAR extensions, automated hyper‑parameter tuning, and domain‑specific weight initialization, positioning SAMoVAR as a promising bridge between classical econometric time‑series modeling and modern deep learning architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment