A universal compression theory for lottery ticket hypothesis and neural scaling laws
When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error, which is proved to be the optimal compression rate. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. Implication (Ia) directly establishes a proof of the dynamical lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-α}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-α’ \sqrt[m]{d})$.
💡 Research Summary
The paper introduces a “Universal Compression Theorem” that rigorously shows how any permutation‑invariant function of d objects can be compressed to depend on only polylog (d) objects with vanishing error. The authors model both neural network parameters and training data as collections of “objects” (e.g., weight vectors or data points) that are symmetric under permutation. Under a mild smoothness assumption—namely that the function admits a deep‑set representation f(θ)=h(∑ g(w_i)) with h and g Taylor‑expandable—they invoke a multivariate version of the Fundamental Theorem of Symmetric Polynomials (FTSP) to express f in terms of the first k moments p_k = (1/d)∑ w_i^{⊗k}. By Tchakaloff’s theorem, matching these moments requires at most N_{m,k}=C(m+k,k) weighted objects, where m is the ambient dimension of each object.
The core technical contribution is a constructive compression algorithm (Algorithm 1) that repeatedly (i) finds a dense cluster of objects whose diameter scales as O((|supp|)^{-1/m}) and (ii) replaces the cluster by at most N_{m,k} weighted representatives while preserving the first k moments. The error analysis (Theorem 3) shows that if the original objects lie within a ball of radius r, the compression error decays as O(d r^{k+1}), so by choosing k large enough (still much smaller than d) the error can be driven to zero while the number of remaining objects shrinks to polylog (d).
Two major implications are explored.
-
Dynamical Lottery Ticket Hypothesis (DLTH). By treating each neuron (or pair of input‑output weight vectors) as an object, the compression preserves all moments that determine the loss and its gradients. Consequently, the compressed network has exactly the same loss landscape and gradient flow as the original, meaning that training dynamics are unchanged. This provides a formal proof of a stronger version of the lottery‑ticket hypothesis: not only does a subnetwork exist that can achieve the same final performance, but it can be trained in exactly the same way as the full network.
-
Accelerated Neural Scaling Laws. When the objects are data points, moment‑preserving compression yields a dramatically smaller weighted dataset that reproduces the original empirical risk. Substituting the compressed size N′ = polylog (N) into the classic scaling law L ∝ N^{−α} (with α≈0.1–0.3) leads to an effective exponent that can be made arbitrarily large, even achieving exponential or super‑exponential decay of loss, e.g., L ≈ exp(−α′ N^{1/m}). This suggests that the apparent data‑inefficiency of large language models is not a fundamental limitation but a consequence of not exploiting permutation symmetry.
The authors validate the theory on synthetic and real architectures, including ReLU multilayer perceptrons and Transformer‑like models. Empirically, they achieve compression ratios from d to polylog (d) with negligible loss in test accuracy, and the training curves of compressed and original models are virtually indistinguishable. Computationally, moment matching costs O(d N_{m,k}^2), which grows quickly with k, but practical experiments use modest k (5–10) and k‑means clustering to keep runtime reasonable.
In summary, the paper provides a mathematically rigorous framework that leverages permutation symmetry to compress both models and datasets dramatically, preserving exact learning dynamics. This unifies and extends prior work on lottery‑ticket pruning and neural scaling, offering a new pathway toward more data‑efficient and compute‑efficient AI systems. Future work may address non‑symmetric architectures, non‑smooth activations, and online or adaptive compression schemes.
Comments & Academic Discussion
Loading comments...
Leave a Comment