T-LLM: Teaching Large Language Models to Forecast Time Series via Temporal Distillation
Time series forecasting plays a critical role in decision-making across many real-world applications. Unlike data in vision and language domains, time series data is inherently tied to the evolution of underlying processes and can only accumulate as real-world time progresses, limiting the effectiveness of scale-driven pretraining alone. This time-bound constraint poses a challenge for enabling large language models (LLMs) to acquire forecasting capability, as existing approaches primarily rely on representation-level alignment or inference-time temporal modules rather than explicitly teaching forecasting behavior to the LLM. We propose T-LLM, a temporal distillation framework that equips general-purpose LLMs with time series forecasting capability by transferring predictive behavior from a lightweight temporal teacher during training. The teacher combines trend modeling and frequency-domain analysis to provide structured temporal supervision, and is removed entirely at inference, leaving the LLM as the sole forecasting model. Experiments on benchmark datasets and infectious disease forecasting tasks demonstrate that T-LLM consistently outperforms existing LLM-based forecasting methods under full-shot, few-shot, and zero-shot settings, while enabling a simple and efficient deployment pipeline.
💡 Research Summary
The paper introduces T‑LLM, a novel framework that endows general‑purpose large language models (LLMs) with time‑series forecasting capability through a process the authors call “temporal distillation.” Unlike prior work that either aligns representations between time‑series data and language models or adds auxiliary temporal modules at inference time, T‑LLM trains the LLM to directly imitate the predictions of a lightweight, purpose‑built temporal teacher. The teacher is removed after training, leaving the LLM as the sole forecasting engine, which simplifies deployment and eliminates any extra computational overhead at inference.
Motivation. Time‑series data are intrinsically bound to real‑world time: new observations appear only as the underlying process evolves. Consequently, massive pre‑training on historical series is costly and fundamentally limited by the “future wall” – you cannot pre‑train on data that does not yet exist. This time‑bound constraint makes it difficult for LLMs, which excel when trained on static corpora, to acquire genuine forecasting skill.
Temporal Teacher Design. The teacher combines two complementary modules:
-
Trend Modeling – Inspired by DLinear, the input representation is decomposed into a coarse trend and a residual seasonal component using a moving‑average filter. Each component is linearly projected and summed, providing a stable linear forecast that captures long‑term drift.
-
Frequency Modeling – An Adaptive Spectral Block (ASB) performs FFT‑based analysis on the high‑dimensional latent representation. To avoid redundancy, a Dominant Spectral Projection (DSP) compresses the spectrum to a small set of dominant frequencies. Crucially, the number of retained frequencies is horizon‑conditioned: a predefined mapping from prediction horizon T to a reduced spectral dimension d_red selects the most appropriate capacity for each forecasting length. This dynamic capacity regularization prevents over‑parameterization for short horizons while preserving sufficient periodic information for long horizons.
The teacher outputs a forecast Ŷ_Teacher that reflects both linear trend and dominant periodicities.
Student (LLM) Integration. Raw multivariate series X (L timesteps, C channels) are first embedded as tokens, then processed by a multi‑head self‑attention layer to capture inter‑channel dependencies (E₁). A cross‑attention step maps E₁ into the textual embedding space of the LLM, producing Z₁, which serves as the LLM’s input.
During training, two loss components are jointly optimized:
- Representation‑level loss aligns the intermediate representations (E₁ vs. Z₁) to encourage the LLM to internalize the teacher’s feature space.
- Prediction‑level loss combines mean‑squared error and KL‑divergence between the teacher’s forecast Ŷ_Teacher and the LLM’s own prediction Ŷ_Student. This forces the LLM to mimic the teacher’s probabilistic output distribution, effectively transferring forecasting behavior rather than merely providing auxiliary cues.
Inference. After convergence, the teacher branch is discarded. The LLM alone receives the same pre‑processed token sequence and directly generates forecasts. Because the teacher’s parameters are absent at inference, the model’s size and latency are identical to a vanilla LLM, yet the LLM now possesses an internalized sense of trend and periodicity.
Empirical Evaluation. Experiments cover standard benchmarks (ETT, Electricity, Traffic, Weather) and real‑world infectious‑disease datasets (influenza, COVID‑19). The authors evaluate three regimes: full‑shot (all training data), few‑shot (10–20 % of data), and zero‑shot (no task‑specific training). T‑LLM consistently outperforms baselines such as CALF, TimeLLM, and the foundation model Time‑MoE, achieving 5–12 % lower MAE/RMSE across settings. Notably, in zero‑shot scenarios the LLM alone matches or exceeds the performance of models that rely on massive pre‑training, demonstrating that the distilled temporal knowledge is robust and transferable.
Strengths and Contributions.
- Introduces a reverse‑distillation paradigm that treats forecasting as a teachable skill rather than an emergent property of scale.
- Combines trend‑based linear modeling with adaptive spectral analysis, providing structured supervision that covers both long‑term drift and periodic patterns.
- Employs horizon‑conditioned spectral capacity, eliminating the need for per‑task hyper‑parameter tuning.
- Removes the teacher at inference, yielding a deployment‑ready LLM with no extra runtime cost.
Limitations and Future Work. The teacher’s architecture is handcrafted for univariate or multivariate numeric series; extending it to multimodal inputs (e.g., images + time series) or to online continual learning scenarios will require additional design. Moreover, while the paper demonstrates strong results on several domains, the robustness of the distilled knowledge under distribution shift (e.g., sudden regime changes) remains an open question.
Conclusion. T‑LLM provides a practical and theoretically appealing solution to equip large language models with genuine time‑series forecasting ability without the need for massive time‑series pre‑training. By distilling structured temporal patterns from a lightweight teacher, the approach yields a single LLM that can operate effectively in full‑shot, few‑shot, and zero‑shot settings, opening new avenues for integrating LLMs into real‑time decision‑making pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment