Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.


💡 Research Summary

Large language models (LLMs) excel at many tasks, yet their generation quality hinges heavily on the decoding algorithm used at inference time. Traditional truncation‑based samplers such as top‑k, top‑p, top‑H, locally typical, or Min‑p rely only on token probabilities and entropy, completely ignoring the semantic geometry encoded in token embeddings. The authors propose Top‑W, a geometry‑aware truncation rule that incorporates a Wasserstein‑1 (Earth Mover’s) distance defined over an embedding‑induced ground metric, together with explicit mass‑entropy trade‑offs.

Core formulation: Given the model’s next‑token distribution (p) over vocabulary (V) and a candidate subset (S\subseteq V), the retained probability mass is (\Gamma_S=\sum_{i\in S}p_i) and the renormalized distribution is (q_S(i)=p_i/\Gamma_S) for (i\in S). The objective to be minimized is
\


Comments & Academic Discussion

Loading comments...

Leave a Comment