Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers
Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.
💡 Research Summary
The paper tackles the prohibitive computational and deployment costs of large‑scale Diffusion Transformer (DiT) models for text‑to‑image (T2I) generation. Starting from Qwen‑Image, a 60‑layer, 20 B‑parameter dual‑stream MMDiT backbone, the authors propose a two‑stage compression pipeline that yields two lightweight models: Amber‑Image‑10B and Amber‑Image‑6B.
Stage 1 – Depth Pruning to Amber‑Image‑10B
The authors introduce a global ablation‑based layer‑importance metric that evaluates each transformer block across a representative prompt set and a subset of diffusion timesteps. By weighting the prediction discrepancy (δ) with a timestep‑dependent factor (ωₜ), the metric captures the fact that errors at early (high‑noise) steps degrade semantic structure more than later steps. The 30 least important layers are pruned, halving the depth. To avoid a cold start, the retained layers are re‑initialized via Local Weight Averaging (LWA): each kept layer’s weights become the arithmetic mean of its own original weights and those of the immediately pruned neighboring layers. This simple warm‑start mitigates the sharp fidelity drop typical of naïve pruning.
Recovery proceeds in two phases. First, only the LWA‑initialized layers are trained using layer‑wise knowledge distillation from the original 60‑layer teacher; the rest are frozen. The distillation target for a re‑initialized layer is the hidden state of the deepest teacher layer within the pruned cluster, encouraging the student to implicitly reconstruct the cumulative transformation of the removed blocks. Second, all parameters are unfrozen and a short global fine‑tuning with the standard diffusion loss aligns the whole network. The result is Amber‑Image‑10B, a 10 B‑parameter model that retains most of the visual fidelity and semantic consistency of its massive predecessor.
Stage 2 – Hybrid‑Stream Conversion to Amber‑Image‑6B
Building on the 10B model, the authors further compress by converting the deeper half of the backbone from a dual‑stream (separate image and text streams) to a single‑stream architecture. The first 10 layers remain dual‑stream to preserve modality‑specific early processing, while layers 11‑30 are merged into a shared stream initialized directly from the image‑branch weights of the 10B teacher. This hybrid design exploits the high cross‑modal redundancy observed in deeper layers, cutting an additional ~40 % of parameters.
A two‑step recovery follows. First, a local distillation phase trains the new single‑stream layers to match concatenated hidden states (image + text) from the teacher, while keeping the early dual‑stream layers frozen as semantic anchors. Then a lightweight full‑parameter fine‑tuning refines the entire model.
Efficiency and Performance
The complete pipeline—from the original Qwen‑Image to Amber‑Image‑6B—requires fewer than 2,000 GPU‑hours on eight NVIDIA A100 GPUs (≈10 days), a fraction of the cost of training comparable models from scratch. Empirical evaluation on benchmarks such as DPG‑Bench and LongText‑Bench shows that Amber‑Image‑6B achieves image quality (FID, IS) and text rendering (CLIPScore) on par with, and sometimes surpassing, proprietary or open‑source models with 20‑30 B parameters. Notably, the method does not rely on massive curated datasets; a small high‑quality dataset (on the order of thousands of images) suffices for the fine‑tuning stages.
Key Contributions
- A timestep‑sensitive global ablation metric for reliable layer importance estimation.
- Local weight averaging as a simple yet effective warm‑start for pruned layers.
- A two‑stage recovery protocol (layer‑wise distillation + global fine‑tuning) that efficiently restores capacity after aggressive pruning.
- A hybrid‑stream architectural transformation that leverages cross‑modal redundancy to further reduce parameters without sacrificing performance.
Overall, Amber‑Image demonstrates a practical pathway to compress large diffusion transformers, dramatically lowering computational and data requirements while preserving high‑fidelity image synthesis and accurate text rendering. This work sets a new benchmark for efficient, deployable T2I models.
Comments & Academic Discussion
Loading comments...
Leave a Comment