A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) – the largest eigenvalue of the loss Hessian – determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.


💡 Research Summary

The paper introduces a scalable curvature metric called critical sharpness (λ_c) for large language model (LLM) training, addressing the prohibitive cost of directly computing Hessian sharpness (the largest eigenvalue of the loss Hessian). Critical sharpness is defined as λ_c = 2 / η_c, where η_c is the critical learning rate: the smallest step size along the current update direction Δθ that causes the loss to increase. To estimate η_c, the authors employ a two‑phase line search—first an exponential expansion to bracket the region where loss starts to rise, then a binary refinement. This procedure requires only forward passes (no backward Hessian‑vector products) and typically converges in 5–6 evaluations, making it compatible with modern distributed training pipelines.

Theoretical analysis shows that under a quadratic approximation of the loss along Δθ, loss increase occurs when η > 2 / λ_dir, where λ_dir = ΔθᵀHΔθ / Δθᵀg is the directional sharpness. λ_dir can be expressed as a weighted sum of Hessian eigenvalues, with weights given by the squared projections of the gradient onto the eigenvectors. Consequently, λ_dir ≤ λ_H^max, and equality holds when the gradient aligns perfectly with the top eigenvector. For adaptive optimizers (e.g., Adam), an analogous relationship holds using the pre‑conditioned Hessian and gradient. Thus, critical sharpness approximates directional sharpness, which in turn bounds Hessian sharpness; the three measures coincide when the loss surface is locally quadratic and the gradient points along the steepest curvature direction.

Empirically, the authors first validate these relationships on CIFAR‑10 MLPs trained with SGD across various batch sizes. Critical sharpness tracks directional sharpness closely and both exhibit the well‑known phenomena of progressive sharpening (steady increase of curvature early in training) and the Edge of Stability (EoS) (curvature reaching the threshold 2 / η and then oscillating). Hessian sharpness shows similar trends but with larger early spikes and more pronounced oscillations after EoS.

The core contribution is scaling this analysis to LLMs. Using checkpoints from the OLMo‑2 family (0.3 B to 7 B parameters), the authors measure λ_c throughout pre‑training and mid‑training. They observe that progressive sharpening persists at these scales, and that λ_c reliably hits the EoS threshold under constant learning rates. When learning‑rate schedules (warm‑up and decay) are applied, λ_c follows the schedule, confirming its sensitivity to optimizer dynamics. Importantly, these observations are obtained with orders of magnitude less computation than required for Hessian eigenvalue estimation.

A novel extension is relative critical sharpness (λ_c^{1→2}), which quantifies the curvature of a source loss (e.g., pre‑training) while the model is being optimized on a target loss (e.g., fine‑tuning). By varying the mix ratio of pre‑training and fine‑tuning data, the authors show that higher proportions of pre‑training data lower λ_c^{1→2}, keeping the model within the “pre‑trained basin.” Downstream experiments reveal task‑dependent effects: math‑oriented GSM8K benefits when the model leaves the pre‑trained basin (η > 2 / λ_c^{1→2}), whereas general reasoning tasks like MMLU perform better when the basin is retained (η < 2 / λ_c^{1→2}). This insight enables practitioners to design data‑mixing strategies that deliberately steer curvature dynamics to favor specific downstream objectives.

In summary, the paper delivers a practical, low‑cost curvature diagnostic that (1) faithfully reproduces classic Hessian‑based phenomena, (2) scales to billions of parameters, (3) bridges pre‑training and fine‑tuning dynamics via relative sharpness, and (4) offers actionable guidance for learning‑rate scheduling and data composition. The work opens avenues for automated curvature‑aware training algorithms, adaptive curriculum design, and broader investigations of curvature’s role in generalization across diverse model families.


Comments & Academic Discussion

Loading comments...

Leave a Comment