One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

💡 Research Summary

The paper addresses the high computational cost of diffusion‑based super‑resolution (SR) models while preserving their superior perceptual quality. Recent diffusion SR approaches such as ResShift achieve impressive results with 15 denoising steps (NFE), but remain significantly slower than GAN‑based methods. Attempts to accelerate them—SinSR’s one‑step knowledge distillation and OSEDiff’s use of large text‑to‑image (T2I) models with variational score distillation (VSD)—either produce blurred outputs or require billions of parameters, excessive GPU memory, and long training times.

To overcome these limitations, the authors propose Residual Shifting Distillation (RSD), a novel one‑step distillation framework built on top of ResShift. The core idea is meta‑distillation: a student generator Gθ maps a low‑resolution (LR) image directly to a high‑resolution (HR) image in a single step, and a “fake” ResShift model fϕ is trained on the synthetic (LR,HR) pairs produced by Gθ. By forcing fϕ to match the teacher ResShift f* on the same data distribution, the student’s output distribution is implicitly aligned with the teacher’s data distribution.

Mathematically, the authors start from the ResShift forward process, where the residual e₀ = y₀ − x₀ is added to the LR image with a time‑dependent Gaussian kernel. The teacher’s reverse process predicts x₀ via a neural network fθ. Directly minimizing the distance between fGθ (the teacher‑trained on student data) and f* is intractable because back‑propagation would need to flow through the entire training of fGθ. The paper resolves this by deriving an equivalent tractable loss (Proposition 3.1) that replaces the gradient through fGθ with the training of the fake model fϕ. The resulting objective can be interpreted as minimizing the KL divergence between the full joint distributions p(x₀:T|y₀) and p*(x₀:T|y₀), which is a stronger alignment than the pointwise VSD loss used in OSEDiff.

RSD also supports multi‑step training: a subset of timesteps {t₁,…,t_N} is selected, and the generator is conditioned on the timestep, allowing it to learn the conditional distributions pθ(bx₀|x_{t_n},y₀) ≈ q(x₀|x_{t_n},y₀) for all selected t_n. During inference, only the final timestep T is used, preserving the one‑step speed while gaining robustness from multi‑step exposure during training.

The experimental protocol spans synthetic and real‑world benchmarks: DIV2K, ImageNet, RealSR, RealSet65, and DRealSR. Evaluation metrics include non‑reference perceptual scores (LPIPS, CLIPIQA, MUSIQ) and reference metrics (PSNR, SSIM). RSD‑1 consistently outperforms SinSR‑1 across all perceptual metrics, achieving higher LPIPS (lower is better), higher CLIPIQA, and higher MUSIQ scores. Compared to OSEDiff‑1, RSD‑1 attains comparable or better perceptual quality while using far fewer parameters (hundreds of thousands versus billions) and far less GPU memory. Notably, RSD‑1 even surpasses the original ResShift‑15 teacher in perceptual metrics, demonstrating that the distillation process can improve over the teacher when the student is guided by the fake‑model alignment.

Ablation studies confirm several design choices: (1) removing the fake model makes training unstable; (2) adding multi‑step conditioning improves the final one‑step PSNR by roughly 0.3 dB; (3) incorporating an LPIPS‑based supervised loss further refines fine‑grained texture details. The authors also compare the RSD loss to the VSD loss analytically, showing that RSD directly minimizes a KL divergence over the entire diffusion trajectory, whereas VSD only aligns scores at a single timestep.

In terms of efficiency, RSD requires a single denoising step (NFE = 1), similar to SinSR, but its parameter count and memory footprint are comparable to SinSR and far lower than OSEDiff’s T2I‑based pipelines. Training time is also reduced because the fake ResShift model is lightweight and trained jointly with the generator.

Overall, the paper makes three major contributions: (I) a theoretically grounded one‑step distillation objective that bridges the gap between knowledge distillation and variational score distillation; (II) a practical implementation that yields a student model outperforming both its teacher and competing one‑step diffusion SR methods in perceptual quality; (III) extensive empirical validation showing that high‑quality, real‑time SR is achievable without resorting to massive T2I models. The work opens avenues for applying the RSD framework to other diffusion‑based restoration tasks and for combining it with text‑conditioned diffusion models for multi‑modal SR.

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment