Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top-k, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-k. We call this algorithm regularized Top-k (RegTop-k). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through various numerical experiments. In distributed linear regression, it is observed that while Top-k remains at a fixed distance from the global optimum, RegTop-k converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop-k in distributed training of ResNet-18 on CIFAR-10, as well as fine-tuning of multiple computer vision models on the ImageNette dataset. Our numerical results confirm that as the compression ratio increases, RegTop-k sparsification noticeably outperforms Top-k.


💡 Research Summary

This paper addresses a critical limitation of the widely used Top‑k gradient sparsification technique in distributed stochastic gradient descent (SGD). While Top‑k reduces communication by transmitting only the k largest entries of the accumulated gradient and uses error accumulation to eventually select previously omitted components, this mechanism implicitly scales the effective learning rate for the selected entries. In smooth loss landscapes this scaling can accelerate convergence, but in heterogeneous data settings or non‑smooth objectives it may cause excessive step sizes, leading to delayed progress, large optimality gaps, or even divergence.

To overcome this issue, the authors reformulate gradient sparsification as an inverse‑probability (statistical inference) problem. Each worker’s sparsification mask is treated as a random variable; a prior distribution is derived from the Top‑k heuristic (probability proportional to the magnitude of accumulated gradients), and a likelihood model is constructed by describing the next aggregated gradient as a linear combination of past aggregations plus additive innovation. Using large‑deviation arguments for high‑dimensional settings, they obtain an asymptotic expression for the likelihood.

Combining the prior and likelihood yields a posterior distribution over possible masks. The maximum‑a‑posteriori (MAP) estimator selects the k entries with the highest posterior probability of belonging to the true top‑k of the global gradient. This MAP rule can be expressed as the classic Top‑k mask plus a regularization term that penalizes large accumulated entries. The regularization strength λ directly controls the learning‑rate scaling introduced by error accumulation. When λ = 0 the method reduces to standard Top‑k; positive λ attenuates the scaling, preventing the overshoot observed in the motivating toy example.

The resulting algorithm, called Regularized Top‑k (RegTop‑k), operates as follows at each worker: (1) compute the local gradient and add the current error vector to obtain the accumulated gradient a; (2) using stored past global gradients, compute posterior statistics (essentially a weighted expectation of a); (3) select the k entries with the highest posterior scores; (4) transmit these entries and update the error vector. The computational overhead remains linear in the model dimension and requires only modest additional memory for the past aggregates.

The authors provide both theoretical and empirical evidence of RegTop‑k’s superiority. In a distributed linear regression setting, RegTop‑k converges linearly to the exact optimum, whereas Top‑k stalls at a fixed distance from the optimum. In deep learning experiments, RegTop‑k applied to ResNet‑18 on CIFAR‑10 with an extreme compression ratio of 0.1 % achieves up to 8 % higher classification accuracy than Top‑k. Further fine‑tuning experiments on the ImageNette benchmark with five modern architectures (SqueezeNet, ShuffleNetV2, MobileNetV2, EfficientNet, ResNet‑152) confirm statistically significant gains across compression ratios ranging from 0.5 % to 2 %. Repeated trials (10 runs per setting) yield p‑values below 0.01, demonstrating that the improvements are not due to random variation.

In summary, the paper introduces a Bayesian optimality perspective to gradient sparsification, derives a regularized Top‑k mask that explicitly controls learning‑rate scaling, and validates the approach with extensive experiments. The framework opens avenues for adaptive regularization, extensions to other compression schemes (quantization, sketching), and applications in asynchronous or multi‑task distributed training.


Comments & Academic Discussion

Loading comments...

Leave a Comment