Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching
Training efficiency in large-scale models is typically assessed through memory consumption, training time, and model performance. Current methods often exhibit trade-offs among these metrics, as optimizing one generally degrades at least one of the others. Addressing this trade-off remains a central challenge in algorithm design. While GaLore enables memory-efficient training by updating gradients in a low-rank subspace, it incurs a comparable extra training time cost due to the Singular Value Decomposition(SVD) process on gradients. In this paper, we propose Lotus, a method that resolves this trade-off by simply modifying the projection process. We propose a criterion that quantifies the displacement of the unit gradient to enable efficient transitions between low-rank gradient subspaces. Experimental results indicate that Lotus is the most efficient method, achieving a 30% reduction in training time and a 40% decrease in memory consumption for gradient and optimizer states. Additionally, it outperforms the baseline method in both pre-training and fine-tuning tasks.
💡 Research Summary
The paper addresses the longstanding trade‑off in large language model (LLM) training among memory consumption, training time, and final model performance. While prior work such as GaLore reduces memory by projecting gradients onto a low‑rank subspace, it incurs a substantial time overhead because it recomputes the subspace at fixed intervals using a full singular value decomposition (SVD) on the gradient matrix. Lotus, the method proposed in this work, eliminates this trade‑off by (1) replacing the exact SVD with a power‑iteration‑based randomized SVD (rSVD) and (2) introducing an adaptive subspace‑switching criterion based on the displacement of unit‑norm gradients.
Randomized low‑rank projection.
Instead of performing an exact SVD on the full‑rank gradient at each update, Lotus computes an approximate low‑rank basis using rSVD. This technique requires only a few matrix‑vector multiplications, dramatically lowering both computational complexity (≈ O(m n log r) versus O(m n min(m,n))) and peak memory usage. Empirically, rSVD matches the performance of exact SVD at the same rank, confirming that the approximation does not degrade the quality of the projected gradient.
Adaptive subspace switching.
The authors define a path‑efficiency ratio ρₜ = D_actual / D_ideal, where D_actual is the accumulated Euclidean displacement of the projected unit‑norm gradients over k steps, and D_ideal is the ideal displacement when all unit gradients are perfectly aligned. ρₜ lies in
Comments & Academic Discussion
Loading comments...
Leave a Comment