Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly
Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre-specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first-step denoising confidence: a local \textit{Oracle Peak} that emerges near the ground-truth length and a systematic \textit{Length Bias} that often obscures this signal. By leveraging this signal and calibrating the bias, our training-free method \textbf{CAL} (\textbf{C}alibrated \textbf{A}daptive \textbf{L}ength) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7% over fixed-length baselines and 40.5% over chat-based adaptive methods in code infilling, while boosting BLEU-2 and ROUGE-L by up to 8.5% and 9.9% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at https://github.com/NiuHechang/Calibrated_Adaptive_Length.
💡 Research Summary
Diffusion Language Models (DLMs) generate sequences by iteratively denoising a fully corrupted input, which naturally provides bidirectional context and makes them theoretically well‑suited for infilling tasks where a middle segment must bridge a given prefix and suffix. In practice, however, DLMs require the length of the masked segment to be fixed in advance. If this length is underestimated, the generated completion is truncated; if overestimated, the model fills the extra slots with redundant or erroneous tokens. This sensitivity to a pre‑specified mask length severely limits DLM performance on code and text infilling benchmarks.
The authors investigate whether DLMs already encode information about the optimal infilling length within their own inference dynamics. They focus on the first denoising step, i.e., the model’s prediction pθ(x0 | xT) when the input is completely masked except for the surrounding context. For a candidate length L they define the average first‑step confidence
Φ(L) = (1/L) ∑{j∈Imask} max{v∈V} p_j(v),
where Imask indexes the masked positions. Measuring Φ(L) across many tasks reveals two systematic phenomena. First, a Oracle Peak appears: Φ(L) attains a local maximum when L is close to the ground‑truth length L*. This indicates that the model is most certain when the allocated mask matches the true amount of information needed to complete the sequence. Second, a Length Bias is observed: Φ(L) monotonically declines as L grows, then stabilises, because longer masked spans weaken contextual constraints and increase uncertainty. The bias can mask the Oracle Peak, making raw confidence unsuitable for direct length selection.
To remove the bias, the authors fit a double‑exponential function B(L) = a e^{‑bL} + c e^{‑dL} + e using confidence values from non‑oracle lengths (excluding a window around L*). The calibrated confidence is then
Φ_c(L) = Φ(L) / B(L).
After division, the length‑dependent decay disappears and the Oracle Peak becomes a clear global optimum.
Building on this insight, they propose CAL (Calibrated Adaptive Length), a training‑free length‑discovery procedure. CAL consists of a probing stage followed by standard denoising. In the probing stage, a set of candidate lengths R is examined by computing Φ_c(L) for each length with a single forward pass. Rather than exhaustive search, CAL employs a bidirectional hill‑climbing algorithm: starting from an initial heuristic length L_init, it evaluates the local gradient of Φ_c(L) in both directions, moves to the neighbor with higher confidence, and repeats until no improvement is observed for D consecutive steps. Because each probe requires only one forward pass, the additional inference overhead is modest. The discovered length \hat L is then used to construct the masked input and the full diffusion decoding proceeds as usual.
Experiments are conducted on several state‑of‑the‑art DLMs (LLaDA‑8B‑Base, DiffuCoder‑Base, DreamCoder, etc.) and on both code‑infilling (HumanEval‑Infilling) and text‑infilling (BLEU‑2, ROUGE‑L) benchmarks. CAL consistently outperforms fixed‑length baselines, achieving up to a 47.7 % relative gain in Pass@1 for code infilling and up to 40.5 % over existing chat‑based adaptive‑length methods. For text infilling, BLEU‑2 improves by 8.5 % and ROUGE‑L by 9.9 %. The bias‑fitting function B(L) learned once generalises across models and tasks, demonstrating the robustness of the approach.
In summary, the paper reveals that DLMs implicitly encode optimal infilling length in their first‑step confidence, identifies and corrects a systematic length bias, and introduces CAL—a lightweight, training‑free length‑adaptation mechanism. CAL decouples DLM infilling performance from rigid mask‑length specifications, substantially advancing the practical utility of diffusion‑based language generation for code completion, text reconstruction, and controllable generation scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment