AltTS: A Dual-Path Framework with Alternating Optimization for Multivariate Time Series Forecasting
Multivariate time series forecasting involves two qualitatively distinct factors: (i) stable within-series autoregressive (AR) dynamics, and (ii) intermittent cross-dimension interactions that can become spurious over long horizons. We argue that fitting a single model to capture both effects creates an optimization conflict: the high-variance updates needed for cross-dimension modeling can corrupt the gradients that support autoregression, resulting in brittle training and degraded long-horizon accuracy. To address this, we propose ALTTS, a dual-path framework that explicitly decouples autoregression and cross-relation (CR) modeling. In ALTTS, the AR path is instantiated with a linear predictor, while the CR path uses a Transformer equipped with Cross-Relation Self-Attention (CRSA); the two branches are coordinated via alternating optimization to isolate gradient noise and reduce cross-block interference. Extensive experiments on multiple benchmarks show that ALTTS consistently outperforms prior methods, with the most pronounced improvements on long-horizon forecasting. Overall, our results suggest that carefully designed optimization strategies, rather than ever more complex architectures, can be a key driver of progress in multivariate time series forecasting.
💡 Research Summary
The paper addresses a fundamental tension in multivariate time‑series forecasting: the coexistence of (i) stable, series‑specific autoregressive (AR) dynamics and (ii) intermittent, often noisy cross‑dimension (CR) interactions that become especially problematic at long horizons. Existing approaches typically employ a single model to capture both phenomena, which the authors argue leads to an optimization conflict. High‑variance gradient updates required for learning CR patterns inject noise into the AR component, destabilizing training and degrading long‑range accuracy.
To resolve this, the authors propose AltTS, a dual‑path architecture that explicitly separates AR and CR modeling and coordinates them through alternating optimization. The AR path is instantiated as a simple linear predictor applied independently to each variable after reversible instance normalization (RevIN). This path mirrors recent linear baselines (e.g., RLinear) but is deliberately isolated from any cross‑variable information. The CR path consists of an inverted Transformer encoder whose multi‑head self‑attention is modified into Cross‑Relation Self‑Attention (CRSA). In CRSA, the diagonal of the attention matrix is masked with −∞, forcing each head to attend only to other variables and thereby preventing the CR module from duplicating the AR function.
Training proceeds by alternating updates: separate optimizers (often Adam) with distinct learning rates are assigned to the AR and CR parameters. Within each epoch the algorithm performs a fixed number of AR updates followed by a fixed number of CR updates, effectively decoupling the stochastic gradients of the two blocks. The authors formalize the “gradient entanglement” problem: when only the aggregate residual r_i = Σ_j r_ij is observable, the gradient for any individual projection f_ij depends on the full residual, causing interference between AR (diagonal) and CR (off‑diagonal) parameters. Alternating optimization eliminates this interference by ensuring each block receives gradients derived from its own residual component.
Empirical evaluation spans seven public benchmarks (Weather, Traffic, Electricity, ETTh1/2, ETTm1/2) and multiple prediction lengths (96, 192, 336, 720). AltTS consistently outperforms strong baselines—including recent Transformer‑based models (PatchTST, Crossformer, iTransformer) and state‑of‑the‑art linear/MLP methods (OLinear, TimesNet)—with relative MSE/MAE improvements ranging from 3 % to 12 %. The gains are most pronounced for the longest horizon (720 steps), confirming the method’s suitability for long‑term forecasting.
Ablation studies validate each design choice. Removing alternating optimization (joint training) dramatically inflates the gradient variance of the CR block and degrades performance. Omitting the diagonal mask in CRSA allows the CR path to learn redundant AR information, again harming stability. Replacing the linear AR predictor with a small MLP yields negligible accuracy gains while increasing computational cost, suggesting that a linear AR model is sufficient when properly isolated. Gradient‑variance plots replicate the authors’ claim: under joint training the CR variance spikes across datasets, whereas alternating training yields a smooth, monotonic decline for both paths.
The paper’s broader contribution is a methodological shift: rather than pursuing ever more intricate architectures, careful alignment of optimization schedules with the intrinsic properties of the data can deliver substantial performance gains. The authors discuss limitations and future directions, including adaptive scheduling (e.g., learning the number of AR vs. CR updates), hybrid AR modules that incorporate lightweight non‑linearities, and applications to other domains such as finance or healthcare where cross‑variable dynamics are critical.
In summary, AltTS demonstrates that decoupling autoregressive and cross‑relation modeling, combined with a principled alternating optimization scheme, can achieve state‑of‑the‑art multivariate time‑series forecasting performance with a comparatively simple architecture. This work underscores the importance of optimization‑aware model design and opens avenues for further research on dynamic training schedules and domain‑specific extensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment