Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.


💡 Research Summary

Masked Autoregressive (MAR) models have emerged as a promising alternative to fully sequential autoregressive (AR) generators by allowing parallel prediction of multiple tokens in a single forward pass. Despite this advantage, MAR’s speedup is limited because predicting many spatially correlated visual tokens simultaneously forces the model to learn a highly complex joint distribution, which quickly degrades generation quality. This paper introduces a training‑free, hierarchical sampling strategy called Generation‑then‑Reconstruction (GtR) that tackles this fundamental bottleneck.

The core insight is two‑fold. First, an image’s global semantic scaffold can be established with a relatively small set of non‑adjacent tokens. Empirical analysis shows that when 50 % of tokens are sampled in a checkerboard pattern, the remaining half can be regenerated with different random seeds while producing almost identical images; the global structure is already decided. Second, high‑frequency components in the latent space correspond to fine‑grained details and are harder to generate than low‑frequency background regions. Therefore, allocating more computational budget to high‑frequency tokens should improve quality without sacrificing speed.

GtR operationalizes these observations in two stages. In the generation stage, tokens satisfying (i + j) mod 2 = 0 (i.e., a checkerboard pattern) are sampled first. The number of masked‑autoregressive steps per token is deliberately kept low, so this stage proceeds slowly but ensures that generated tokens are spatially dispersed, providing a strong, globally coherent conditioning signal. In the reconstruction stage, the remaining tokens ((i + j) mod 2 = 1) are filled in with a very high parallel ratio—often in a single or at most two masked‑autoregressive steps—because each of these tokens is now surrounded by already‑generated neighbors, making the task akin to “detail completion” rather than full creation.

To further exploit the observation that detail‑rich tokens demand more computation, the authors propose Frequency‑Weighted Token Selection (FTS). For each token’s latent vector, a 2‑D Fourier transform is applied; tokens with larger high‑frequency energy are identified as “detail tokens.” During sampling, these tokens receive additional diffusion steps or higher mask‑sampling ratios, while low‑frequency tokens are processed quickly. This training‑free weighting scheme directs the limited inference budget toward the most challenging parts of the image.

Algorithmically, the full token set is recursively bisected using modulo‑based rules (Algorithm 1), yielding K disjoint subsets {S₁,…,S_K}. The first K − 1 subsets constitute sub‑stages of the generation phase, progressively increasing the spatial coverage of generated tokens, while the final subset is the reconstruction phase. Within each subset S_k, the conditional distribution p(S_k | S_{<k}) is factorized into M_k masked‑autoregressive steps (M_k ≤ |S_k|). This hierarchical decomposition reduces inter‑token dependency in each step, allowing larger parallel ratios without sacrificing the model’s ability to capture spatial correlations.

Extensive experiments on ImageNet class‑conditional generation and text‑to‑image synthesis (e.g., COCO captions) demonstrate that GtR combined with FTS achieves an average 3.72× speedup over the baseline MAR‑H model while preserving generation quality (FID = 1.59, IS = 304.4 versus original 1.59, 299.1). Compared with existing acceleration techniques—such as sampling‑schedule adjustments, token pruning, or diffusion‑based shortcuts—GtR consistently offers higher speed gains with negligible quality loss across multiple model scales.

The method requires no changes to the training objective, architecture, or learned parameters; it is purely a sampling‑time modification. Consequently, it can be retrofitted to any existing MAR implementation. The authors also discuss limitations: the fixed checkerboard partition may be suboptimal for highly asymmetric scenes, and the Fourier‑based FTS adds a modest memory overhead for frequency analysis. Future work could explore learned, dynamic token partitioning (e.g., via reinforcement learning) and extend the two‑stage paradigm to other non‑autoregressive generators such as diffusion models or VQ‑GANs.

In summary, the paper presents a conceptually simple yet technically robust approach that mirrors human visual creation—first sketching a global layout, then filling in details—to dramatically accelerate masked autoregressive image generation without compromising fidelity. This contribution opens a practical pathway for deploying high‑quality, high‑resolution generative models in real‑time applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment