Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models’ trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
💡 Research Summary
The paper tackles the long‑standing latency bottleneck of autoregressive (AR) large language models (LLMs) by turning them into efficient parallel decoders without sacrificing the causal priors learned during pre‑training. Recent diffusion‑based LLMs (dLLMs) enable parallel generation but suffer from a severe pre‑train‑to‑post‑train mismatch: they replace the causal mask with bidirectional attention, and they train on masked data distributions that differ sharply from the natural data seen during pre‑training. Consequently, dLLMs achieve limited speed‑ups and cannot fully exploit modern accelerator FLOPs.
The authors propose Jacobi Forcing, a progressive distillation framework that trains AR models on their own parallel decoding trajectories, preserving the original causal attention. The core idea builds on Jacobi decoding, a fixed‑point iteration that updates an entire block of tokens in parallel using a causal mask. Starting from a random token block, each iteration replaces every token with the arg‑max prediction conditioned on the current left‑context. The process converges to the same output as greedy AR decoding, guaranteeing correctness.
Jacobi Forcing introduces two key innovations:
-
Progressive Noise Schedule – Large blocks are split into smaller sub‑blocks, each assigned a noise ratio that linearly increases from 0 to 1 within a cyclic window. This ensures that every prediction always has a partially clean context, dramatically reducing the longest noisy dependency chain and stabilizing training for large block sizes.
-
Progressive Consistency Loss – For each sub‑block, the model is asked to predict the “clean” target tokens while being conditioned on a mixture of clean and noisy tokens taken from the Jacobi trajectory at the appropriate noise level. The loss is a KL‑divergence between the teacher (the original AR model, stop‑gradient) and the student, summed across all sub‑blocks. By packing clean and noisy sequences together and using a block‑wise sparse attention mask, the authors reduce the number of forward passes from O(N) to O(1), making training efficient.
The final training objective combines this progressive consistency loss with a standard AR loss (weighted by a tunable λ) to preserve generation quality.
To break the speed‑up ceiling observed after many training steps, the authors perform iterative re‑distillation: after training with a certain block size, they generate new Jacobi trajectories using the partially trained model with a larger block size, then continue distillation. This yields an additional ~20 % speed gain with negligible performance loss.
On the inference side, two complementary optimizations are introduced:
-
Rejection‑Recycling – During Jacobi decoding, high‑quality n‑grams that appear in the noisy (unconverged) states are stored in a pool. When a new iteration’s last accepted token matches the first token of a stored n‑gram, the whole n‑gram is proposed as a candidate and verified in parallel across the batch dimension. This allows many tokens to be fast‑forwarded in a single iteration.
-
Multi‑Block Decoding – Multiple blocks are kept alive simultaneously; even if earlier blocks have not yet converged, later blocks can already produce correct tokens based on the partially cleaned context. This further raises the number of tokens accepted per iteration.
Experiments on coding benchmarks (HumanEval, MBPP) and a mathematics benchmark (MATH) demonstrate that the Jacobi Forcing Model achieves 3.8×–4.5× higher token acceptance per iteration and 3.5×–4.0× wall‑clock speed‑up compared with a strong AR baseline, while incurring less than 0.3 % drop in standard quality metrics (e.g., Pass@1). Ablation studies confirm the importance of the progressive noise schedule, the packed‑sequence loss computation, and the inference‑time recycling strategies.
In summary, the paper presents a practical, theoretically grounded method to convert high‑quality AR LLMs into fast parallel decoders. By training on self‑generated trajectories, preserving causal masks, and leveraging clever inference tricks, Jacobi Forcing bridges the gap between AR quality and dLLM‑style speed, and the released codebase enables immediate adoption in research and production settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment