StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).


💡 Research Summary

The paper addresses the growing need for efficient distributed inference of Diffusion Transformers (DiTs), which are increasingly used for high‑quality image and video generation. As resolution and video length increase, the sequence length in attention layers grows, causing activation tensors to become too large for a single GPU’s memory and leading to high latency. Existing sequence‑parallel (SP) approaches—Ring Attention, Ulysses Attention, and Unified Sequence Parallelism (USP)—mitigate this by sharding the sequence dimension across multiple GPUs, but they suffer from three fundamental drawbacks: (1) they ignore the stark bandwidth disparity between intra‑machine (NVSwitch/NVLink) and inter‑machine (Ethernet, InfiniBand) networks; (2) Ulysses Attention’s all‑to‑all operations cannot overlap with computation, creating a communication bottleneck; and (3) two‑sided communication libraries such as NCCL impose implicit synchronization barriers that add unnecessary latency.

StreamFusion is introduced as a topology‑aware, high‑performance SP engine that resolves these issues through three innovations. First, a topology‑aware communication scheduler deliberately assigns Ring Attention to intra‑machine communication (where low‑latency, high‑bandwidth links are abundant) and Ulysses Attention to inter‑machine communication (where the all‑to‑all pattern can exploit the higher aggregate bandwidth of the network fabric). This reverses USP’s design, which places Ring Attention across machines and Ulysses within a machine, thereby reducing the volume of inter‑machine traffic that does not shrink with more GPUs.

Second, the authors propose Torus Attention, a novel algorithm that partitions the all‑to‑all steps of Ulysses Attention into multiple chunks. By recognizing that the tensors before and after each all‑to‑all are “stationary,” they pipeline the communication of chunk k+1 while simultaneously computing attention on chunk k. This overlap eliminates the idle waiting periods that previously dominated the critical path, cutting inter‑machine all‑to‑all latency by roughly 45 % in their measurements.

Third, StreamFusion replaces NCCL’s two‑sided send/receive primitives with one‑sided operations provided by NVSHMEM. Push‑pull semantics allow the programmer to control when synchronization occurs, removing the implicit barriers that forced each GPU to wait for its peer’s data to be ready. The result is a leaner communication pipeline with lower synchronization overhead and better utilization of GPU compute resources.

The authors evaluate StreamFusion on several state‑of‑the‑art DiT models, including Stable Diffusion 2.1, CogVideoX, and other recent video generation transformers. Experiments span configurations from two to four GPU machines, each containing four to eight GPUs, and cover both 100 Gbps Ethernet and InfiniBand RDMA interconnects. Across all settings, StreamFusion achieves an average end‑to‑end inference speedup of 1.35× over USP, with a peak improvement of 1.77×. Memory consumption remains comparable to USP, allowing generation of a 10‑second 768 × 1360 video (which would exceed a single A100’s 40 GiB memory) on a two‑machine setup without out‑of‑memory failures.

In summary, StreamFusion contributes (1) a topology‑aware scheduling strategy that aligns communication patterns with the physical bandwidth hierarchy of modern GPU clusters, (2) Torus Attention, which enables fine‑grained overlap of all‑to‑all communication and attention computation, and (3) a one‑sided communication implementation that eliminates unnecessary synchronization. The work demonstrates that careful co‑design of communication algorithms and hardware topology can substantially improve the scalability of diffusion transformer inference, paving the way for real‑time, high‑resolution generative AI services on multi‑node GPU farms. Future directions include scaling to larger clusters (e.g., 64‑GPU configurations), exploring NVLink‑based multi‑node topologies, and extending the approach to other transformer‑based generative models such as text‑to‑image pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment