Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Reading time: 5 minute
...

📝 Original Info

  • Title: Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
  • ArXiv ID: 2512.01278
  • Date: 2025-12-01
  • Authors: Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed S. Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica

📝 Abstract

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13x throughput speedup.

💡 Deep Analysis

Figure 1

📄 Full Content

ACCELERATING LARGE-SCALE REASONING MODEL INFERENCE: SELF-SPECULATIVE DECODING WITH SPARSE ATTENTION Yilong Zhao 1 * Jiaming Tang 2 * Kan Zhu 3 Zihao Ye 4 Chi-Chih Chang 5 Chaofan Lin 6 Jongseok Park 1 Guangxuan Xiao 2 Mohamed S. Abdelfattah 5 Mingyu Gao 6 Baris Kasikci 3 Song Han 2 4 Ion Stoica 1 ABSTRACT Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as both the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system optimizations: (1) a unified scheduler to batch both draft and verification phases to maximize parallelism, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to enable host memory offload to maximize GPU memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13× throughput gain. Code is open-sourced at github.com/sspec-project/SparseSpec. 1 INTRODUCTION Recent advances in reasoning language models (RLMs), such as OpenAI-o1 (OpenAI, 2024), have demonstrated re- markable capabilities in solving complex reasoning tasks. These models typically generate tens of thousands of to- kens from problems described in only hundreds of tokens through extensive chain-of-thought (CoT) (Snell et al., 2024; DeepSeek-AI, 2025a). This lengthy, deliberate reasoning paradigm shifts the performance bottleneck of inference from compute-bound to memory-bound (Zhao et al., 2024b). Due to the auto-regressive nature of RLMs, generating each token requires loading all previously generated key-value vectors (KV-Cache), making long-output tasks memory- bound. In fact, the total amount of KV-Cache that needs to be loaded increases quadratically with output length (Tang et al., 2024). For example, when serving Qwen3-8B (Qwen, 2025) on an H100 with a batch size of 128 and an output of 8192, loading the KV-Cache takes on average 21 ms per step, accounting for over 70% of the end-to-end latency. To mitigate the memory bandwidth bottleneck, researchers have proposed a lossless technique, speculative decod- *Equal contribution by coin flip. 1University of California, Berkeley 2Massachusetts Institute of Technology 3University of Washington 4NVIDIA 5Cornell University 6Tsinghua University. Preprint, under review. ing (Chen et al., 2023). In a nutshell, speculative decoding employs a smaller and faster draft model to generate mul- tiple candidate tokens sequentially. These candidates are then verified in parallel by the original target model. This requires reading the large KV-Cache only once. In contrast, without speculation, the entire KV-Cache needs to be loaded for every token. As a result, speculative decoding substan- tially reduces memory access and improves throughput. However, existing speculative decoding methods require additional training or modification for each target model, limiting their applicability. Specifically, some solutions re- quire training a separate standalone draft model (Chen et al., 2023); others modify the model architecture by adding de- coder layers (Li et al., 2025a). These additional steps in- crease the complexity of real-world deployment. For ex- ample, training a small draft model requires careful data curation for a specific workload and may not generalize (Liu et al., 2024b). Moreover, deployment also needs to redesign the inference framework to orchestrate both models effi- ciently (Miao et al., 2024). Ultimately, this creates signifi- cant barriers to adoption (Zhang et al., 2024). Other work explores training-free methods, which either em- ploy rule-based heuristics (e.g., N-Gram (Fu et al., 2024)) or leverages the model itself via self-speculation (Liu et al., 2024a; Sun et al., 2024). For example, MagicDec (Chen et al., 2024) adopts the full model with a sliding window arXiv:2512.01278v1 [cs.LG] 1 Dec 2025 Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding Figure 1. Comparison between autoregressive generation (a), draft model-based speculative decoding (b), and SparseSpec. SparseSpec identifies the KV-Cache loading as the key bottleneck during token generation, and uses the same model weights with dynamic sparse attention as the draft

📸 Image Gallery

fig-batch-scheduler-src.png fig-cpu-small-schedule.png fig-design-attn-src.png fig-design-offloading-src.png fig-eval-ablation-src.png fig-eval-attn-src.png fig-eval-batch-src.png fig-motivation-speedup-src.png fig-motivation-utils-src.png fig_accu_sens-src.png fig_e2e_tokens_per_sec-src.png fig_eagle3_vs_sspec-src.png fig_pillarattn-src.png system_lite_v1.png teaser-src.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut