Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.


💡 Research Summary

Background and Motivation
Large language model (LLM) pre‑training is increasingly performed in continual or open‑ended settings where the total number of training steps cannot be known in advance. The de‑facto standard learning‑rate schedule, cosine decay, is horizon‑dependent: it requires a predefined training length to set its amplitude and decay rate. Consequently, cosine schedules must be re‑tuned for each compute budget, which is impractical for streaming data or continual learning pipelines. Recent work introduced the warm‑up‑stable‑decay (WSD) schedule as an “anytime” alternative, but it still relies on a few horizon‑dependent hyper‑parameters.

Theoretical Contributions
The authors study over‑parameterized linear regression (a quadratic loss) as a proxy for deep networks. Assuming a power‑law spectrum of the data covariance and source/capacity exponents (β, γ), they prove that stochastic gradient descent (SGD) with a polynomially decaying learning rate ηₜ = 1/t^γ (0 < γ < 1) combined with tail‑averaging (or exponential moving average, EMA) attains the minimax optimal risk. The decay exponent γ is determined by the spectral properties; in practice γ = ½ (i.e., ηₜ ∝ 1/√t) works well. They also show that a constant learning rate together with weight averaging can achieve the same rate, provided the averaging window is chosen appropriately, and crucially, this scheme does not depend on the final horizon. Thus, both schedules are “anytime”: they can be deployed without knowing the total number of steps and still match the performance of a horizon‑aware, optimally tuned cosine schedule.

Experimental Setup

  • Models: Two Transformer‑style LLMs (150 M and 300 M parameters) built on the OLMo codebase.
  • Data: C4 dataset, tokenized with T5 tokenizer, no repetition, fully online streaming.
  • Compute Budgets: Training runs span powers of two multiples of the Chinchilla compute budget: 1×, 2×, 4×, 8×, 16×, 32× for the 150 M model and up to 16× for the 300 M model.
  • Optimiser: AdamW (β₁ = 0, ε = 1e‑8) with a sweep over learning rates (1e‑4 – 1e‑2) and β₂ ∈ {0.95, 0.98, 0.99}.
  • Schedules Compared: (i) constant η with EMA, (ii) ηₜ = p · α/(t+α) (≈ 1/√t) with EMA, (iii) WSD (linear warm‑up, constant phase up to 90 % of the run, linear decay for the last 10 %), and (iv) cosine decay tuned separately for each compute budget (the standard baseline).
  • EMA Details: Multiple EMAs are maintained simultaneously with half‑life fractions f ∈ {0, 6.25, 12.5, 25, 50, 100}, where f = 0 corresponds to using the last iterate only.

Key Empirical Findings

  1. Loss Trajectories: Across all intermediate checkpoints, the three “anytime” schedules (constant + EMA, 1/√t + EMA, and WSD + EMA) closely track the “cosine envelope” – the best loss achievable by a horizon‑specific cosine schedule at each checkpoint. The maximum deviation is less than 0.1 % of validation loss.
  2. Performance Relative to Cosine: In the “loss improvement over cosine” plots, the anytime methods are essentially flat around zero, sometimes slightly better (negative values) in the later stages, indicating no systematic degradation.
  3. Hyper‑parameter Efficiency: The cosine baseline requires a fresh hyper‑parameter sweep for every compute multiple, whereas the anytime methods are trained once at the largest horizon (32× or 16×) and reused, dramatically reducing tuning cost.
  4. Memory Overhead: Maintaining several EMAs adds only one extra copy of the model parameters per EMA, a negligible overhead compared to the overall model size.
  5. Stability: The 1/√t schedule shows a modest dip in early training (higher loss) but EMA quickly stabilizes it. The constant‑η schedule starts with a slightly higher loss but converges to the same asymptotic performance as the other methods.

Interpretation and Implications
The experiments validate the theoretical claim that horizon‑free learning‑rate schedules, when paired with weight averaging, can achieve the same statistical efficiency as an optimally tuned cosine decay. This is significant for real‑world LLM training pipelines where data streams continuously and the total compute budget cannot be predetermined. Practitioners can adopt a simple polynomial decay (or even a constant learning rate) together with EMA, eliminate the need for horizon‑specific tuning, and still obtain state‑of‑the‑art pre‑training quality.

Limitations and Future Work

  • The theoretical analysis is confined to quadratic loss landscapes (linear regression). Extending the minimax guarantees to deep, non‑convex networks remains an open challenge.
  • The study focuses on validation loss; downstream task performance (e.g., zero‑shot prompting) is not reported.
  • Automatic selection of EMA half‑life (the f parameter) could be explored via meta‑learning or adaptive schemes.
  • Combining the anytime schedule with other optimizer innovations (e.g., schedule‑free optimizers, adaptive momentum) may further improve robustness.

Conclusion
“Anytime pre‑training” demonstrates that simple, horizon‑free step‑size schedules—specifically polynomial decay (1/√t) or constant learning rates—combined with weight averaging, match the performance of the widely used cosine decay across a wide range of compute budgets. This provides a practical, low‑overhead alternative for large‑scale language model pre‑training, especially in continual‑learning or streaming data scenarios where the training horizon is unknown.


Comments & Academic Discussion

Loading comments...

Leave a Comment