VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation
Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental to modern generative modeling, yet they often suffer from training instability and “codebook collapse” due to the inherent coupling of representation learning and discrete codebook optimization. In this paper, we propose VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training. Our key insight is that, from the neural network’s viewpoint, performing quantization primarily manifests as injecting a structured perturbation in latent space. Accordingly, VP-VAE replaces the non-differentiable quantizer with distribution-consistent and scale-adaptive latent perturbations generated via Metropolis–Hastings sampling. This design enables stable training without a codebook while making the model robust to inference-time quantization error. Moreover, under the assumption of approximately uniform latent variables, we derive FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers. Extensive experiments on image and audio benchmarks demonstrate that VP-VAE and FSP improve reconstruction fidelity and achieve substantially more balanced token usage, while avoiding the instability inherent to coupled codebook training.
💡 Research Summary
Vector Quantized Variational Autoencoders (VQ‑VAEs) have become a cornerstone for learning discrete latent representations that can be fed into sequence models such as Transformers. Despite their success, VQ‑VAEs suffer from two intertwined problems: (1) training instability caused by the simultaneous optimization of the encoder‑decoder and the codebook, and (2) “codebook collapse,” where only a small subset of code vectors receive updates while the rest become dead. Existing remedies either add complex regularizers and heuristics to the coupled training loop or replace the learnable codebook with a fixed grid (e.g., FSQ, LFQ). The former still retains the coupling, while the latter forces the latent distribution to conform to a rigid geometry, often wasting representational capacity.
The paper introduces VP‑VAE (Vector Perturbation VAE), a paradigm that completely decouples representation learning from discretization. The authors reinterpret quantization as the injection of a structured perturbation into the latent space. During training, instead of performing a non‑differentiable nearest‑neighbor lookup, they apply a perturbation operator T(z; S) to each latent vector z, producing a perturbed vector ˜z. The perturbation is generated by a Metropolis–Hastings (MH) sampler that uses a memory buffer S of recent latent vectors to estimate the empirical density. Two essential properties are enforced: (i) scale alignment – the perturbation magnitude matches the expected quantization error for a target codebook size K, and (ii) distribution consistency – perturbed vectors stay within high‑density regions of the original latent distribution.
Scale alignment is achieved by measuring the distance to the M‑th nearest neighbor in S, where M = |S| / K. The radius R(z) = η·D_M(z|S) (η is a small hyper‑parameter) serves as a data‑adaptive estimate of the typical Voronoi cell size for a K‑codebook. This automatically adapts to local density variations and to the desired bitrate. Distribution consistency is ensured by the MH acceptance step, which only accepts proposals that remain likely under the non‑parametric density estimate (implemented via kernel density estimation). Consequently, the perturbation does not push latent vectors into low‑probability regions, preserving the decoder’s capacity to model realistic latents.
Because reliable density estimation in high‑dimensional spaces is difficult, VP‑VAE introduces a low‑dimensional bottleneck d ≤ 16. The encoder first projects token embeddings h_t ∈ ℝ^C to z ∈ ℝ^d via a learnable down‑projection P↓, applies the perturbation, and then up‑projects back with P↑ before feeding the result to the decoder. This design keeps the perturbation tractable while retaining enough expressive power for high‑fidelity reconstruction.
After training converges, an explicit codebook is constructed from the buffer S using K‑means or a Voronoi clustering step, and is used only at inference time. Since the codebook never participated in training, the collapse phenomenon disappears entirely.
Under the simplifying assumption that each latent dimension is approximately uniformly distributed, the authors derive a lightweight variant called Finite Scalar Perturbation (FSP). FSP treats each dimension independently, applies centered scalar perturbations, and quantizes by mapping to interval centroids. The method aligns with the Lloyd‑Max optimal scalar quantizer, providing a theoretical justification for the empirical gains over fixed‑grid quantizers such as FSQ. FSP is computationally cheaper and avoids the inefficiencies of rigid grids when the true latent distribution deviates from uniformity.
Extensive experiments on image (CIFAR‑10, ImageNet‑64) and audio (VCTK) benchmarks demonstrate that VP‑VAE improves reconstruction metrics (PSNR, SSIM) by roughly 1 dB over standard VQ‑VAE with STE, while achieving >95 % token utilization and eliminating dead codes. FSP outperforms FSQ by 0.8 dB in PSNR and shows more balanced token usage. Training curves are smoother, and the method is robust to the choice of η and K, indicating reduced hyper‑parameter sensitivity.
In summary, VP‑VAE offers a novel “quantization‑as‑perturbation” perspective that removes the need for a learnable codebook during representation learning, thereby solving the long‑standing instability and collapse issues of VQ‑VAEs. The Metropolis–Hastings based adaptive perturbation, automatic scale calibration, and low‑dimensional bottleneck together form a coherent framework that can be readily integrated into existing VAE pipelines. The FSP extension further bridges the gap between theory and practice for fixed‑grid quantizers. Future work may explore more efficient sampling schemes, extensions to non‑uniform latent distributions, and applications to large‑scale multimodal models where token efficiency is critical.
Comments & Academic Discussion
Loading comments...
Leave a Comment