Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining

Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing optimizer states. A key challenge in these methods is selecting suitable subspaces to ensure an effective optimization trajectory. Most existing approaches select the dominant subspace to preserve gradient information, as this intuitively provides the best approximation. However, we find that in practice, the dominant subspace stops changing during pretraining, thereby constraining weight updates to similar subspaces. In this paper, we propose importance sampling for low-rank optimization in LLM pretraining with a provable convergence guarantee, which the dominant subspace approach does not have. Empirically, we demonstrate that our method significantly outperforms previous methods in LLM pretraining tasks.


💡 Research Summary

This paper, “Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining,” addresses a critical limitation in memory-efficient training techniques for Large Language Models (LLMs) and proposes a novel solution.

The core challenge in LLM pretraining is the massive memory footprint of optimizer states, particularly in Adam, which stores two state matrices each as large as the model parameters. Low-rank optimization methods, such as GaLore and Fira, tackle this by projecting the high-dimensional gradients onto a low-rank subspace, allowing optimizer states to be stored in this compressed space. The key to these methods’ effectiveness is the selection of this low-rank subspace. The prevailing approach is to choose the “dominant subspace” spanned by the singular vectors corresponding to the largest singular values of the gradient matrix, as this best preserves the gradient information.

However, the authors identify a fundamental flaw in this dominant subspace approach: the “Frozen Subspace” phenomenon. Through experiments on LLaMA model pretraining, they show that the dominant subspace of gradients in many layers becomes highly stable and stops evolving after the initial training stages. This leads to a “low-rank bottleneck,” where weight updates are repeatedly constrained to very similar directions. Consequently, the cumulative weight update matrix remains effectively low-rank, severely limiting the model’s representational capacity and final performance.

To overcome this, the paper proposes SARA (Importance SAmpling for Low-RAnk Optimization). Instead of always selecting the top singular vectors, SARA constructs the low-rank subspace by performing importance sampling without replacement from the set of left singular vectors. The probability of selecting each vector is proportional to its corresponding singular value. This strategy maintains a preference for the most influential directions (large singular values) but introduces controlled randomness, allowing less dominant directions to be explored with a non-zero probability. This simple change significantly reduces the overlap between subspaces used in adjacent training steps, fostering diversity in the optimization trajectory.

SARA offers several key advantages. First, it is a plug-and-play module that can be seamlessly integrated into existing low-rank optimizers like GaLore-Adam or Fira-Adam by simply replacing their subspace selection routine. Second, it comes with a provable convergence guarantee, which the dominant subspace method lacks. The authors provide theoretical analysis (Lemma 3.3, Theorem 3.4), showing that the projection error of SARA is bounded and that low-rank momentum SGD with SARA converges at a rate comparable to prior provable methods like GoLore. Third, the computational overhead is negligible—only adding a tiny sampling cost after the SVD computation.

Empirical results on pretraining LLaMA models of various scales (60M, 130M, 1B parameters) demonstrate SARA’s effectiveness. SARA-enhanced versions of GaLore and Fira consistently achieve lower validation perplexity than their dominant-subspace counterparts and the random-projection-based GoLore. Notably, for the LLaMA-1B model, SARA-GaLore reduced the performance gap with full-rank Adam by up to 46.05%. The method also remains compatible with other memory-saving techniques like Adafactor or Adam-mini.

In summary, this work diagnoses a critical performance bottleneck in low-rank optimization for LLMs—the frozen dominant subspace—and provides an elegant, theoretically sound, and empirically superior solution via importance sampling. SARA significantly advances the practicality of memory-efficient training without compromising model quality.


Comments & Academic Discussion

Loading comments...

Leave a Comment