iFSQ: Improving FSQ for Image Generation with 1 Line of Code
The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ
💡 Research Summary
The paper addresses a fundamental obstacle in the field of image generation: the inability to fairly compare autoregressive (AR) models that operate on discrete tokens with diffusion models that work on continuous latents. This difficulty stems from the different tokenizers each paradigm relies on—Vector‑Quantized VAEs (VQ‑VAEs) for AR and standard VAEs for diffusion—creating a split that confounds performance attribution. Finite Scalar Quantization (FSQ) has been proposed as a theoretical bridge because it replaces a learnable codebook with simple rounding, yielding both discrete indices and continuous values. However, the authors identify a critical flaw in vanilla FSQ: it uses a tanh activation to bound latent values before quantization, which, when applied to the typically Gaussian‑distributed activations of vision networks, leads to “activation collapse.” Most data fall into a few central bins, resulting in high reconstruction fidelity but low information efficiency (≈83 % bin utilization). Conversely, forcing equal‑probability bins sacrifices reconstruction quality.
To resolve this, the authors replace the tanh with a distribution‑matching mapping:
y = 2 · sigmoid(1.6 · x) – 1
This single‑line change transforms the unbounded Gaussian latent distribution into an approximately uniform distribution over
Comments & Academic Discussion
Loading comments...
Leave a Comment