Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing distribution compression methods reduce the number of observations in a dataset by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which we introduce to quantify the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that BDC can achieve comparable or superior downstream task performance to ambient-space compression at substantially lower cost and with significantly higher rates of compression.

💡 Research Summary

The paper introduces Bilateral Distribution Compression (BDC), a novel framework that simultaneously reduces the number of samples and the dimensionality of large‑scale datasets while preserving their underlying probability distribution. Traditional distribution‑compression methods focus solely on minimizing the Maximum Mean Discrepancy (MMD) between a small synthetic set and the full dataset, thereby addressing only the sample‑size dimension. Dimensionality reduction, on the other hand, is typically handled by separate techniques such as PCA, t‑SNE, UMAP, or autoencoders, which do not produce a compressed set of observations. BDC bridges this gap by defining a new discrepancy measure, Decoded MMD (DMMD), which evaluates the MMD between the original data and the reconstruction of a compressed latent set after decoding back to the ambient space.

Directly minimizing DMMD is intractable because the decoder and the compressed set are tightly coupled, leading to a highly non‑convex optimization problem. To overcome this, the authors propose a two‑stage optimization:

Latent‑space learning via Reconstruction MMD (RMMD). An autoencoder consisting of an encoder ψ : ℝᵈ → ℝᵖ and a decoder φ : ℝᵖ → ℝᵈ is trained to minimize RMMD, the MMD between the original dataset and its reconstruction φ(ψ(x)). Unlike the classic mean‑squared reconstruction error (MSRE) used in PCA, RMMD measures the discrepancy between entire distributions, thereby preserving higher‑order moments. The authors prove (Theorem 3.1) that with a quadratic kernel k(x,y) = (1 + xᵀy)², minimizing RMMD recovers the same subspace as PCA, establishing a clear link between the two objectives.
Latent‑space compression via Encoded MMD (EMMD). After fixing the encoder ψ learned in stage 1, a small set of latent points C = {z_j}₁ᵐ ⊂ ℝᵖ (with m ≪ n) is optimized to minimize EMMD, the MMD between the empirical distribution of ψ(D) and that of C, using a kernel h on the latent space (typically Gaussian). This step can be interpreted as a discretized Wasserstein gradient flow and enjoys global convergence guarantees under mild convexity assumptions.

The authors establish rigorous theoretical connections: Theorem 3.3 shows that if RMMD and EMMD both converge to zero, then DMMD also converges to zero, guaranteeing that the two‑stage procedure indeed minimizes the original objective. Moreover, Theorem 3.5 provides a bound DMMD ≤ RMMD + EMMD for any choice of decoder and latent‑space kernel, offering a practical way to control the final discrepancy.

To prevent the decoder from simply memorizing the data (a risk when the decoder is highly expressive), the paper discusses regularization strategies for both linear and nonlinear autoencoders. Linear autoencoders are constrained to orthonormal encoders (Stiefel manifold optimization), ensuring that the latent space is a genuine p‑dimensional subspace. Nonlinear autoencoders employ bottleneck architectures, weight tying, and decoder capacity limits, all of which encourage the latent representation to capture meaningful structure rather than act as an identity map.

Computationally, the entire BDC pipeline scales linearly in both the number of samples n and the ambient dimension d (O(nd) time and memory), a substantial improvement over quadratic‑time kernel methods.

Empirically, the authors evaluate BDC on image datasets (CIFAR‑10, a subsampled ImageNet), text classification (AGNews), and synthetic high‑dimensional data. Across tasks, BDC achieves compression ratios ranging from 10× to 100× while matching or surpassing the downstream performance (classification accuracy, regression R²) of state‑of‑the‑art ambient‑space compression methods such as kernel herding, kernel thinning, and the recent M3D approach. Notably, BDC does so with far lower computational cost and memory footprint, making it suitable for very large datasets where existing methods become prohibitive.

The paper’s contributions can be summarized as follows:

Formal definition of bilateral distribution compression and the DMMD metric.
Introduction of RMMD and EMMD as tractable surrogates that together guarantee DMMD minimization.
Theoretical analysis linking RMMD to PCA and providing explicit bounds on DMMD.
Extension to supervised settings via label‑aware variants of DMMD, RMMD, and EMMD.
Comprehensive experiments demonstrating superior compression efficiency and downstream task performance.

In conclusion, Bilateral Distribution Compression offers a principled, scalable, and empirically validated solution for reducing both the sample size and dimensionality of massive datasets while faithfully preserving their statistical properties. This advances the state of the art in data condensation, enabling more sustainable and cost‑effective training of large‑scale machine learning models.

Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

💡 Research Summary

Comments & Academic Discussion

Leave a Comment