Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States
Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model’s internal hidden states to assess the likelihood of generating sequences with specific meanings. Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.
💡 Research Summary
Large Language Models (LLMs) have achieved remarkable performance across a wide range of tasks, yet their deployment is hampered by high inference latency caused by the inherently sequential nature of autoregressive decoding. This problem is especially acute for Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek R1, which generate long chains of thought (CoT) before producing a final answer. Speculative decoding (SD) has emerged as a promising technique to alleviate this bottleneck: a fast draft model proposes multiple future tokens, and the target LLM verifies them in a single parallel forward pass, accepting tokens that meet a verification criterion. Existing SD approaches operate at the token level and ignore semantic equivalence—different token sequences that convey the same meaning. Consequently, many semantically correct drafts are unnecessarily rejected, limiting the acceptance rate and overall speedup. Recent sequence‑level speculative methods for LRMs (e.g., SpecReason, Speculative Thinking) attempt to speculate reasoning steps instead of individual tokens, but they rely on LLM‑as‑a‑judge evaluations, which are known to be biased and unreliable.
The paper introduces SemanticSpec, a novel semantic‑aware speculative decoding framework that elevates the granularity of speculation from tokens to entire semantic sequences. The core idea is to estimate a semantic probability: the likelihood that a model will generate any token sequence expressing a particular meaning, rather than the probability of a specific token string. Directly computing this probability would require exhaustive sampling, which defeats the purpose of SD. The authors observe that hidden states inside an LLM correlate strongly with semantic probability; hidden representations cluster according to the underlying meaning’s likelihood. Leveraging this insight, they design an offline‑trained semantic probability predictor that maps aggregated hidden states (average‑pooled across all layers) to a scalar probability estimate.
During inference, the draft model (M_q) generates up to (\gamma) candidate sequences (\tilde{s}{n+1},\dots,\tilde{s}{n+\gamma}) together with their hidden states. The target model (M_p) processes the same candidates in parallel, producing its own hidden states. Both sets of hidden states are fed to their respective predictors, yielding draft‑side probabilities (q_i) and target‑side probabilities (p_i). A candidate sequence is accepted with probability (\min(1, p_i \cdot q_i)); otherwise the target model falls back to generating tokens autoregressively from the last accepted context. This acceptance rule effectively requires agreement between the draft and target on the semantic confidence of a sequence, allowing semantically equivalent but lexically different drafts to be kept.
The authors evaluate SemanticSpec on two 32‑billion‑parameter models—DeepSeekR1‑32B and QwQ‑32B—across four benchmarks, including MATH‑500 and GPQA‑D. Baselines comprise token‑level speculative decoding (Leviathan, Speculative Sampling) and recent sequence‑level methods (SpecReason, Speculative Thinking). Results show average speedups of 2.7× for DeepSeekR1‑32B and 2.1× for QwQ‑32B, with token‑per‑second throughput improvements of 1.67× and 2.66×, respectively. Importantly, accuracy remains on par with or slightly better than baselines, demonstrating that the semantic‑aware verification does not sacrifice output quality. An analysis of acceptance rates reveals a 15‑20 % increase over token‑level SD, confirming that many previously rejected drafts are now accepted thanks to semantic equivalence handling.
Ablation studies highlight that (1) using hidden states from all layers outperforms using only the final layer, (2) average pooling provides a simple yet effective aggregation method, and (3) the predictor’s offline training is crucial—its performance directly impacts overall speedup. Limitations include dependence on the predictor’s generalization to unseen domains and the additional memory overhead of storing multiple hidden state tensors for each candidate sequence.
In conclusion, SemanticSpec demonstrates that probing internal model states to estimate semantic probabilities enables a more flexible and efficient speculative decoding paradigm. By shifting the verification focus from surface token matches to underlying meaning alignment, the framework achieves substantial latency reductions for large reasoning models without compromising answer correctness. Future work is outlined to explore multimodal hidden representations, dynamic adjustment of the draft count (\gamma), and more sophisticated clustering techniques for semantic probability estimation, aiming to broaden applicability across languages and modalities.
Comments & Academic Discussion
Loading comments...
Leave a Comment