Generative Preprocessing for Image Compression with Pre-trained Diffusion Models
Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model’s attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model’s powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.
💡 Research Summary
The paper introduces a novel “Rate‑Perception” (R‑P) optimization framework for image compression preprocessing, moving beyond the conventional pixel‑level “Rate‑Distortion” (R‑D) paradigm. The authors leverage a large‑scale pre‑trained text‑to‑image diffusion model, Stable Diffusion 2.1, and adapt it for image‑to‑image translation in a computationally efficient manner. Their approach consists of two stages.
Stage 1 – Distillation.
Because the original diffusion pipeline requires dozens of denoising steps, it is far too slow for preprocessing. The authors therefore employ Consistent Score Identity Distillation (CiD), a recent knowledge‑distillation technique designed for image‑to‑image tasks. CiD uses a frozen teacher model to provide a stable “score anchor” (the latent representation of the original image) and trains a student model to match the teacher’s score while aligning with this anchor. The loss combines an identity term that enforces consistency with the anchor and a score‑difference term, weighted by a hyper‑parameter ξ. The result is a compact, single‑step U‑Net that retains the generative prior of Stable Diffusion but runs in roughly 1 % of the original inference time. The text encoder is discarded and replaced by a fixed embedding, further reducing overhead.
Stage 2 – Rate‑Perception Fine‑tuning.
The distilled model is still a generative network, so it cannot be directly used as a preprocessing filter. The authors fine‑tune only the attention modules of the U‑Net (the query, key, and value projection matrices) while keeping the VAE encoder/decoder frozen. This parameter‑efficient strategy preserves the rich semantic knowledge of the foundation model while adapting it to the compression task.
A differentiable surrogate for the BPG codec (diff‑BPG) is introduced to enable end‑to‑end training. The surrogate replaces the non‑differentiable mode‑selection step with a soft‑argmin and approximates the quantization rounding operation using a Fourier‑series expansion, allowing gradients to flow through the entire compression pipeline. An entropy model predicts bits‑per‑pixel (bpp) for the rate term.
The overall loss is a weighted sum of three components: an L1 pixel‑wise term (preserving fine details), a perceptual term based on DISTS (capturing human‑aligned structural similarity), and a rate term derived from the diff‑BPG’s bpp estimate. The weight λ for the rate term is dynamically adjusted as a function of the quantization parameter (QP) via an exponential schedule λ(QP)=exp(w₁·QP+w₂), enabling the network to learn appropriate trade‑offs across a wide bitrate range.
Experiments.
Distillation is performed on a mixture of LSDIR and FFHQ, while fine‑tuning uses DIV2K and Flickr2K patches filtered by a Just‑Noticeable‑Distortion (JND) score > 0.8 to focus on texture‑rich regions. Training runs on eight NVIDIA V100 GPUs with Adam (initial LR = 1e‑3, cosine annealing to 1e‑8). Evaluation uses the Kodak and CLIC‑Professional validation datasets, applying three standard codecs (JPEG, WebP, BPG). Perceptual quality is measured with LPIPS, DISTS, and TOPIQ‑fr, and bitrate savings are reported as Bjøntegaard Delta Bit‑Rate (BDBR).
Results show substantial gains: on Kodak, the method achieves a –30.13 % BDBR reduction in DISTS when paired with BPG, outperforming the recent TDP preprocessing baseline by 20‑30 % across all metrics. Similar improvements are observed on CLIC (–27.68 % DISTS BDBR). Rate‑Perception curves demonstrate that the proposed method dominates the anchor and TDP baselines in the low‑to‑mid bitrate regime (bpp < 0.5), delivering higher perceptual quality at lower bitrates. At higher bitrates, the generative nature of the model sometimes leads to larger deviations from the original image, causing a slight drop in similarity‑based metrics—a known trade‑off of generative preprocessing.
Significance and Limitations.
The work is the first to integrate a large‑scale diffusion foundation model into image compression preprocessing, introducing a principled R‑P optimization that leverages generative priors for texture synthesis and artifact mitigation. By distilling the model to a single step and fine‑tuning only attention projections, the authors achieve a practical balance between performance and computational cost. However, the approach still relies on a relatively large diffusion backbone, and real‑time deployment would require further model compression or hardware‑specific optimizations. Moreover, the generative alterations that benefit low‑bitrate perceptual quality may be undesirable in applications demanding strict fidelity at higher bitrates.
Future Directions.
Potential extensions include: (1) applying the framework to newer, larger diffusion models (e.g., Stable Diffusion XL) and exploring more aggressive pruning or quantization; (2) joint optimization with a broader set of codecs, including modern formats like AVIF or JPEG‑XL; (3) hardware‑aware implementations (ONNX, TensorRT) for edge or mobile deployment; and (4) adaptive R‑P control using meta‑learning or reinforcement learning to automatically select the optimal λ‑schedule based on content characteristics.
In summary, the paper presents a compelling and technically sound pipeline that repurposes pre‑trained diffusion models for perceptually‑oriented compression preprocessing, achieving notable bitrate reductions and visual quality improvements without modifying existing codecs. It opens a promising research avenue at the intersection of generative modeling and image compression.
Comments & Academic Discussion
Loading comments...
Leave a Comment