Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting
Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.
💡 Research Summary
The paper “Reverso: Efficient Time Series Foundation Models for Zero‑shot Forecasting” challenges the prevailing belief that ever‑larger transformer‑based time‑series foundation models (TSFMs) are required for high‑quality zero‑shot forecasting. Instead, the authors propose a lightweight family of models—named Reverso—whose parameter counts range from 0.2 M to 2.6 M, yet achieve performance comparable to or better than state‑of‑the‑art TSFMs that contain hundreds of millions or even billions of parameters.
Core Architectural Idea
Reverso adopts a hybrid sequence‑mixing strategy that alternates between two inexpensive primitives: (1) a long convolution layer implemented via FFT‑based depth‑wise separable convolutions, and (2) a linear recurrent layer called DeltaNet. The long convolution uses a kernel whose length equals the full context (L = 2048), giving a computational complexity of O(d L log L) and allowing the model to capture very long‑range dependencies without quadratic cost. DeltaNet updates a hidden state with a query‑key‑value formulation and a learned gating scalar β, and it incorporates a “state‑weaving” trick where the last time step of the previous layer is added to the first hidden state of the current layer, thereby providing bidirectional context.
Each sequence‑mixing block is followed by a simple channel‑mixing MLP that expands the hidden dimension by a factor of four and applies ReLU, mirroring the feed‑forward network of standard transformers but with far fewer parameters. The final decoder head projects the transformed sequence into a set of p = 48 query vectors, attends over keys and values derived from the same sequence, and then linearly maps the attention output to the forecast horizon. Positional embeddings are unnecessary for the smallest models, while a sinusoidal embedding yields a modest gain for the 2.6 M‑parameter variant.
Training Data and Augmentation
All Reverso models are pretrained on the GiftEval dataset, which contains roughly 4.5 M time series and 230 B data points. Because GiftEval is heavily imbalanced across its constituent datasets, the authors enforce a uniform sampling policy: at most 100 k series are drawn from each source per epoch, and no more than 48 windows are extracted from any single series.
To enrich the training distribution, a multi‑stage augmentation pipeline is applied in the following order: down‑sampling, amplitude modulation, horizontal and vertical flips (sign inversion and temporal reversal), censoring (randomly masking subsequences), and mix‑up across the batch. Additionally, synthetic series are generated using the KernelSynth approach: Gaussian processes with randomly selected kernels from a curated bank are combined via random additive or multiplicative operations, and then overlaid with spike, trapezoidal, trend, seasonality, and irregularity components.
Experimental Findings
Three model sizes are evaluated: Reverso‑S (0.2 M parameters), Reverso‑M (0.8 M), and Reverso‑L (2.6 M). On the full GiftEval test set, Reverso‑L matches or exceeds the MAE of the best existing transformer‑based TSFMs (e.g., Time‑GPT, Chronos‑2) while using less than 1 % of their FLOPs and memory. Even the smallest Reverso‑S outperforms large baselines by 5‑10 % in MAE, demonstrating that the hybrid convolution/DeltaNet backbone retains sufficient expressive power for zero‑shot forecasting. Inference speed is dramatically improved: the authors report over a 100‑fold reduction in latency compared with a 1 B‑parameter transformer, making Reverso suitable for real‑time or edge deployments.
Ablation studies confirm that (i) removing the long‑convolution component degrades performance sharply, (ii) replacing DeltaNet with more complex gated linear attention or GLU modules yields no measurable benefit, and (iii) the simple MLP channel mixer outperforms more sophisticated alternatives.
Significance and Limitations
Reverso establishes that careful architectural choices—leveraging FFT‑based convolutions for cheap long‑range mixing and linear RNNs for stateful updates—can close the performance gap to massive transformers while delivering orders‑of‑magnitude efficiency gains. This work therefore provides a practical blueprint for building cost‑effective TSFMs that can be trained on publicly available heterogeneous time‑series corpora and deployed in resource‑constrained environments. Limitations include a primary focus on univariate series; extending the approach to high‑dimensional multivariate forecasting and to irregularly sampled data remains an open research direction.
In summary, the paper delivers a compelling argument and empirical evidence that “bigger is not always better” for time‑series foundation models, and introduces Reverso as a lightweight yet powerful alternative for zero‑shot forecasting across diverse domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment