EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.

💡 Research Summary

The paper introduces Exponential Moving Average Guidance (EMAG), a training‑free technique that modifies the attention maps of diffusion transformers during inference to generate “hard negative” samples. Traditional classifier‑free guidance (CFG) improves conditional generation by linearly combining conditional and unconditional predictions, but suffers from reduced diversity and over‑saturation at high guidance scales. Recent approaches that contrast a strong model with a weaker one (via reduced capacity, attention masking, stochastic layer dropping, etc.) aim to expose failure modes, yet they lack fine‑grained control over the difficulty and granularity of the negative samples, and often use a fixed layer selection strategy.

EMAG addresses these gaps by replacing, at selected timesteps and layers, the current attention matrix Aₜ with its exponential moving average Eₜ. The EMA is updated as Eₜ = β·Eₜ₋₁ + (1−β)·Aₜ, where β is derived from a half‑life H=50 (β≈e^{−ln2/H}). This operation suppresses high‑frequency refinements while preserving global structure, thereby producing subtle, semantically faithful degradations that act as hard negatives. The method includes a statistics‑based adaptive layer‑selection rule: for each timestep the variance and mean of attention across layers are computed, and layers with the highest variance (i.e., those contributing most to fine‑grained updates) are preferentially chosen for EMA replacement. This adaptive scheme ensures that the guidance targets the most “sensitive” parts of the network at each diffusion step, avoiding unnecessary degradation.

Algorithm 1 describes the unconditional case, while Algorithm 2 extends it to conditional generation with optional CFG scaling. In both, the original denoiser output zₜ and the EMA‑perturbed output \hat{z}_t are combined as \bar{z}_t = \hat{z}_t + w_e·(zₜ – \hat{z}_t), where w_e is the EMAG strength. For conditional generation, the combined result is further blended with the standard CFG update using the CFG scale w_cfg, yielding a unified formulation where CFG+EMAG ≡ EMAG.

Experiments are conducted on two transformer‑based diffusion backbones: DiT and MMDiT (the latter underlying Stable Diffusion 3). Both class‑conditional and text‑to‑image tasks are evaluated on the COCO‑2014 validation split, using identical sampling steps and the Human Preference Score (HPS) as the primary metric. EMAG alone improves HPS from 29.22 to 29.68 (+0.46). When composed with Advanced Projected Guidance (APG) or Condition‑Annealed Diffusion Sampler (CADS), EMAG yields additional gains, demonstrating its complementary nature. Qualitative comparisons (Figures 2‑3) show that prior negative‑sample methods (e.g., SA‑G, SEG, ERG, S²‑Guidance) often produce obvious blurs or noise, whereas EMAG creates nuanced, near‑miss degradations that retain overall semantics while exposing subtle artifacts. This enables the denoiser to correct errors that would otherwise be missed.

The authors discuss limitations: the EMA requires a warm‑up period δₜ before swapping, and the decay factor β must be tuned per dataset/model. The approach is currently limited to transformer‑based diffusion models; extending it to CNN‑based architectures remains an open question. Nevertheless, EMAG provides a simple, plug‑and‑play module that can be added to existing pipelines without retraining, offering precise control over negative‑sample hardness and improving both objective and human‑perceived quality.

In summary, EMAG advances diffusion sampling by introducing a controllable, training‑free mechanism to generate hard negatives via attention‑space EMA and adaptive layer selection. It boosts human preference metrics, integrates seamlessly with existing guidance strategies, and opens avenues for more fine‑grained, semantics‑aware sampling in generative AI.

EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance

💡 Research Summary

Comments & Academic Discussion

Leave a Comment