Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers

Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup


💡 Research Summary

This paper tackles the long‑standing practical question of why learning‑rate warm‑up is essential for modern norm‑constrained optimizers such as Muon and Lion, and how to automate its duration. The authors first formalize the update rule of Linear‑Minimization‑Oracle (LMO) based methods, where the step direction is the solution of a linear problem over a unit ball defined by an arbitrary norm. Traditional convergence analyses rely on a uniform L‑smoothness assumption, which cannot explain the empirical need for a warm‑up phase.

To bridge this gap, the authors introduce a generalized smoothness assumption: the Lipschitz constant of the gradient is not fixed but depends on the sub‑optimality gap Δ = f(x) − f★. Specifically, they define K(x) = K₀ + K₁·Δ + K_ρ·Δ^ρ and require
‖∇f(x) − ∇f(y)‖* ≤ K(x)·‖x − y‖,
where ‖·‖
* is the dual norm of the norm used in the LMO. This (ρ, K₀, K₁, K_ρ)‑smoothness captures the empirical observation that curvature diminishes as the optimizer approaches a low‑loss region.

The authors validate this assumption empirically on large‑scale LLaMA pre‑training runs. For Lion, Muon, and normSGD they plot the “smoothness ratio” Kₜ = ‖∇f(xₜ₊₁) − ∇f(xₜ)‖_* / ‖xₜ₊₁ − xₜ‖ against Δₜ and find a clear quadratic relationship, confirming that K_ρ > 0 is necessary to model the whole training trajectory (Δ shrinks by orders of magnitude).

Under this assumption, Theorem 1 shows that choosing the learning rate as
ηₜ = Δₜ / D · K(xₜ) (with D bounding the distance to the optimum) yields a monotone decrease of the sub‑optimality gap and a learning‑rate schedule that automatically exhibits a warm‑up phase followed by decay. The schedule rises while Δₜ > Δ′ = (K₀/(K_ρ(ρ−1)))^{1/ρ} and falls afterwards. The convergence bound is O(1/T), matching classic rates but with a tighter constant during early iterations because the adaptive K(xₜ) allows larger ηₜ.

Theorem 2 extends the analysis to include weight decay (xₜ₊₁ = (1 − ληₜ)xₜ + ηₜ·LMO(gₜ)). In this setting boundedness of iterates is no longer required; the decay term itself keeps the iterates in a compact set. The same adaptive ηₜ (scaled by λ) again yields O(1/T) convergence.

Motivated by these results, the authors propose a practical adaptive scheduler that only needs standard hyper‑parameters (initial max learning rate, min learning rate, maximum warm‑up steps). At each step it estimates Δₜ from the current loss, computes K(xₜ) using the fitted constants K₀, K₁, K_ρ (obtained from a short pilot run), and sets ηₜ accordingly. The warm‑up ends when ηₜ reaches its peak, which is detected automatically, eliminating any manual tuning of the warm‑up length.

The scheduler is evaluated on LLaMA‑7B and LLaMA‑13B models trained with Muon, Lion, and normSGD. Baselines include (i) a constant learning rate, (ii) the widely used linear warm‑up + cosine decay, and (iii) hand‑tuned warm‑up schedules. Across all configurations the adaptive scheduler matches or surpasses the best hand‑tuned baseline in final perplexity and token‑level accuracy, while achieving faster loss reduction in the early training phase (≈15‑25 % quicker in the first 10 % of steps). Ablation studies show that the method is robust to reasonable misspecification of f★ (up to ±10 % error) and that the quadratic term K_ρ·Δ^ρ is crucial for the warm‑up effect.

The paper also discusses limitations: the need for an estimate of the target loss f★, the deterministic nature of the theoretical analysis (real training uses stochastic gradients), and potential instability on highly non‑normalized losses. Nonetheless, the work provides a theoretically grounded explanation for warm‑up, demonstrates that it naturally emerges from a sub‑optimality‑dependent smoothness model, and delivers a plug‑and‑play adaptive scheduler that removes a major source of manual hyper‑parameter tuning in large‑scale language model training.


Comments & Academic Discussion

Loading comments...

Leave a Comment