Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.
💡 Research Summary
Timer‑S1 is a billion‑scale time‑series foundation model that pushes the limits of pre‑trained forecasting systems by introducing a “Serial Scaling” strategy across three dimensions: architecture, data, and training pipeline. Architecturally, the model combines a decoder‑only Transformer backbone with two novel block types. The TimeMoE block implements a Mixture‑of‑Experts (MoE) mechanism tailored for time‑series heterogeneity, containing 8.3 B total parameters while activating only 0.75 B parameters per token, thereby achieving high capacity with modest memory footprints. The second block, TimeSTP, realizes Serial‑Token Prediction (STP), a training objective that mirrors the inherently serial nature of forecasting. Instead of the conventional next‑token or multi‑token prediction that requires costly rolling inference and suffers from error accumulation, STP shifts the input series by one step and predicts the next token in a progressive, serial fashion. Crucially, TimeSTP blocks are retained after pre‑training, allowing the same serial computation graph to be used at inference time, which eliminates the need for iterative autoregressive loops.
On the data side, the authors curate TimeBench, a trillion‑scale corpus comprising over one trillion time points drawn from diverse domains such as IoT sensor streams, climate records, financial tick data, and healthcare monitoring. To mitigate domain bias, they apply systematic data augmentation techniques—resampling, value flipping, and controlled noise injection—ensuring balanced representation across frequencies, scales, and statistical properties. During pre‑training, each sample is treated as a set of forecasting tasks with variable input and output horizons, enabling the model to learn both short‑term dynamics and long‑term dependencies simultaneously.
The training pipeline is split into two stages. The first stage is massive unified pre‑training using a combination of mean‑squared error loss and a horizon‑weighted loss that emphasizes short‑horizon accuracy, which is essential for downstream tasks that often prioritize immediate forecasts. The second stage introduces Continued Pre‑Training (CPT) and Long‑Context Extension. CPT fine‑tunes the model on additional data with a weighted STP objective, sharpening short‑term performance without overwriting the general representations learned earlier. Long‑Context Extension leverages a re‑parameterized Rotary Positional Embedding (RoPE) to stretch the effective context window from 2 880 tokens to 11 520 tokens, allowing the model to ingest and reason over much longer historical windows while preserving positional information.
Evaluation on the large‑scale GIFT‑Eval benchmark demonstrates that Timer‑S1 achieves state‑of‑the‑art results: a CRPS of 0.485 and a MASE of 0.693, both the best among pre‑trained models. The gains are especially pronounced for medium‑ and long‑term horizons (beyond 48 hours), where the serial computation of STP reduces error propagation compared to rolling autoregressive baselines. Scaling law analyses reveal a non‑linear relationship between model size, context length, and forecasting accuracy, with an optimal MoE routing ratio around 9 % that balances computational efficiency and predictive power. Ablation studies confirm the necessity of each component: removing TimeSTP leads to a steep increase in long‑horizon error; omitting data augmentation causes domain‑specific performance drops of up to 5 %; and training without the CPT stage degrades short‑term MASE from 0.693 to 0.78.
The paper also discusses limitations. MoE routing can cause GPU memory fragmentation, demanding sophisticated scheduling in large‑scale clusters. The current routing mechanism does not adapt dynamically to sudden regime shifts, which are common in real‑world series. Moreover, a uniform patch size is used across all domains, suggesting that domain‑specific patch engineering could further improve results. Future work aims to address these issues and to integrate Timer‑S1 into multimodal, agentic AI systems that combine text, images, and time‑series data for autonomous decision‑making. The authors commit to open‑sourcing the model and the TimeBench dataset to foster broader research in scalable, high‑fidelity time‑series forecasting.
Comments & Academic Discussion
Loading comments...
Leave a Comment