LoRDO: Distributed Low-Rank Optimization with Infrequent Communication
Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M–$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.
💡 Research Summary
This paper introduces LoRDO, a novel framework designed to address the dual bottlenecks of communication overhead and optimizer state memory in distributed training of large foundation models. The core challenge is that while infrequent communication strategies reduce synchronization frequency, they remain hampered by the memory and communication demands of optimizer states. Conversely, low-rank optimizers (e.g., GaLore, LDAdam) can reduce these demands but suffer from performance degradation in local-update regimes due to noisy projections computed from small per-worker batches.
LoRDO provides a principled unification of low-rank optimization with infrequent synchronization. Its design is based on two key insights and corresponding mechanisms. First, the authors identify that using a global projection matrix, computed from the aggregated pseudo-gradient (the total parameter change over K local steps) across all workers, is superior to per-worker local projections. This global approach leverages a larger effective batch size (M * B), yielding a more stable low-rank subspace and ensuring all workers optimize within a unified basis. However, they theoretically and empirically demonstrate that this method alone leads to subspace stagnation, permanently restricting the optimization trajectory to a fixed rank-r subspace.
To overcome this stagnation, LoRDO incorporates a full-rank quasi-hyperbolic momentum (QHM) term into the local update. Instead of applying the QHM coefficient solely within the low-rank space, LoRDO scales the original full-rank gradient by the inverse of the second momentum’s mean and adds it to the low-rank projected update. This injects a full-rank signal into the pseudo-gradient, thereby restoring the model’s ability to explore the entire parameter space while retaining the benefits of the stable global projection.
The framework also integrates essential techniques from prior work: momentum rotation when a new projection matrix is computed (to keep momentum aligned with the current subspace) and local error feedback (to improve optimization fidelity). Algorithmically, LoRDO operates with periods: model parameters are synchronized every K_x steps, optimizer states (low-rank momenta) every K_u and K_v steps, and the global projection matrix is recomputed only at parameter synchronization points, minimizing communication.
Empirical validation is conducted using peri-norm Transformer language models at scales of 16M, 125M, and 720M parameters. The results confirm the theoretical analysis: 1) Global projection alone causes stagnation, which is alleviated by full-rank QHM. 2) LoRDO achieves near-parity with low-rank DDP (perplexity gap <1%) on language modeling and downstream tasks. 3) Crucially, it does so while reducing communication volume by approximately 10x compared to low-rank DDP, by leveraging infrequent synchronization of both parameters and optimizer states. 4) Under severe memory constraints forcing very low ranks (e.g., 64) and small batch sizes, LoRDO not only matches but surpasses the performance of DDP by 3.36-4.7% in perplexity, demonstrating its particular utility in resource-constrained environments. 5) Ablation studies confirm the importance of momentum rotation and error feedback.
In summary, LoRDO successfully bridges the gap between communication-efficient distributed training and memory-efficient low-rank optimization. By combining a stable global projection with a mechanism for full-rank exploration, it maintains model performance while drastically reducing communication and memory overhead, offering a practical solution for scaling the training of large models under infrastructure constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment