Deep Kernel Fusion for Transformers

Deep Kernel Fusion for Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.


💡 Research Summary

The paper addresses a growing performance bottleneck in agentic large‑language‑model (LLM) inference: memory bandwidth, rather than raw compute, limits throughput when processing very long contexts. In modern Transformer‑based LLMs the traditional two‑layer ReLU MLP has been replaced by a SwiGLU design, which consists of two large linear projections (W up and W gate) followed by a SiLU‑based gating operation and a final projection (W down). Because the weight matrices of the first two projections are 3.5–4× larger than the model dimension, each token generation requires loading massive amounts of data from high‑bandwidth memory (HBM). Existing optimizations focus mainly on the attention path; the MLP path remains under‑optimized despite accounting for a substantial fraction of per‑token latency, especially in autoregressive decoding with small batch sizes and long KV caches.

The authors propose DeepFusionKernel, a deeply fused CUDA kernel that collapses the separate GEMM and point‑wise kernels used in typical SwiGLU implementations (four kernels in a naïve PyTorch flow, two kernels in SGLang/vLLM) into a single operator. The key idea is to stream intermediate results directly through the computation: while performing the first GEMM (X W up) the kernel simultaneously computes the second GEMM (X W gate), applies the SiLU activation, multiplies the two streams element‑wise, and writes only the final intermediate tensor needed for the downstream W down projection. By avoiding materialisation of large temporaries in HBM, the kernel dramatically reduces read/write traffic.

To achieve high performance across diverse workloads, the authors explore two tiling strategies:

  1. Row‑major tiling – tiles the input activation matrix X by rows, improving reuse of X when batch sizes are larger or when activations dominate traffic.
  2. Column‑major tiling – tiles the weight matrices, maximizing reuse of weight tiles when model parameters dominate (common for agentic decoding with batch = 1‑4).

Tile sizes are tuned to balance register pressure, shared‑memory occupancy, and Tensor‑Core utilization. Because the optimal configuration depends on model shape, batch size, GPU micro‑architecture, and inter‑connect topology, a lightweight profiler‑driven scheduler benchmarks a small set of candidate kernels at deployment time and selects the highest‑throughput variant. The profiling step takes only a few milliseconds and is performed once before inference; thereafter CUDA Graph capture eliminates any runtime dispatch overhead.

Experimental methodology: The authors integrate DeepFusionKernel into the SGLang inference framework, enable FlashInfer and CUDA Graphs, and compare against three baselines: naïve distributed PyTorch, default SGLang kernels, and vLLM. Experiments run on 4‑GPU tensor‑parallel clusters (TP = 4) of NVIDIA A100 80 GB and H100 80 GB. The primary model is LLaMA‑3.1‑70B in FP16, with a fixed prompt length of 1 token and output lengths ranging from 1 024 to 16 384 tokens. Batch sizes vary from 1 to 64. Each configuration is measured four times; mean throughput and standard deviation are reported.

Results:

  • On A100, DeepFusionKernel yields up to 9.7 % higher decoding throughput than SGLang; on H100 the gain reaches 13.2 %. The benefit is largest for small batch sizes where the workload is strongly memory‑bandwidth bound.
  • As batch size grows, the advantage diminishes on A100 (compute becomes saturated) but remains noticeable on H100 because its compute capability is far higher (≈ 1.98 TFLOPs vs. 0.31 TFLOPs) and memory bandwidth still limits performance.
  • In long‑generation tests (output lengths up to 16 384 tokens) the MLP continues to consume a significant portion of per‑token latency despite the growing attention cost. DeepFusionKernel consistently outperforms both SGLang and vLLM across all lengths and concurrency levels (batch = 1, 4, 16), with speedup variability mainly due to inter‑GPU communication jitter.
  • The kernel scheduler adds negligible overhead; after the initial profiling step, the fused kernel runs under CUDA Graph capture, eliminating recurring dispatch costs.

Related work: Prior fusion efforts (Apex, TensorRT‑LLM, DeepSpeed‑MII) perform shallow fusion (e.g., GEMM + activation) but leave large MLP buffers untouched. Compile‑time fusion frameworks such as WELDER, TVM, and Blockbuster can fuse linear chains but either lack runtime feedback or do not target the full SwiGLU tree. DeepFusionKernel distinguishes itself by deeply fusing the entire SwiGLU block and coupling with a runtime‑aware scheduler.

Limitations: The study does not exhaustively evaluate different inter‑connects (NVLink vs. PCIe) or quantify the exact impact of inter‑GPU communication on overall latency. Softmax and other long‑range dependent kernels remain outside the fusion scope. The authors note these as future work.

Conclusion: DeepFusionKernel provides a practical, deployable optimization for memory‑bandwidth‑bound LLM inference. By eliminating intermediate HBM traffic in the SwiGLU MLP, it allows modern GPUs—especially the high‑throughput H100—to approach their theoretical compute limits. The combination of aggressive kernel fusion and a lightweight profiling scheduler yields consistent throughput improvements (up to 13.2 % on H100, 9.7 % on A100) across a range of batch sizes and long‑generation agentic workloads, making it a valuable addition to existing inference stacks such as SGLang and vLLM.


Comments & Academic Discussion

Loading comments...

Leave a Comment