Decoding in Geometry: Alleviating Embedding-Space Crowding for Complex Reasoning
Sampling-based decoding underlies complex reasoning in large language models (LLMs), where decoding strategies critically shape model behavior. Temperature- and truncation-based methods reshape the next-token distribution through global probability reweighting or thresholding to balance the quality-diversity tradeoff. However, they operate solely on token probabilities, ignoring fine-grained relationships among tokens in the embedding space. We uncover a novel phenomenon, embedding-space crowding, where the next-token distribution concentrates its probability mass on geometrically close tokens in the embedding space. We quantify crowding at multiple granularities and find a statistical association with reasoning success in mathematical problem solving. Motivated by this finding, we propose CraEG, a plug-and-play sampling method that mitigates crowding through geometry-guided reweighting. CraEG is training-free, single-pass, and compatible with standard sampling strategies. Experiments on multiple models and benchmarks demonstrate improved generation performance, with gains in robustness and diversity metrics.
💡 Research Summary
The paper investigates a previously overlooked phenomenon in large language model (LLM) decoding, which the authors call “embedding‑space crowding.” Traditional sampling‑based decoding methods such as temperature scaling, top‑p, and top‑k modify token probabilities globally but ignore the fine‑grained geometric relationships among token embeddings. The authors show that, during complex reasoning tasks, the next‑token probability mass often concentrates on a tight cluster of geometrically similar tokens. This concentration—embedding‑space crowding—correlates negatively with reasoning success, especially on mathematical problem‑solving benchmarks.
To quantify crowding, the authors introduce three hierarchical metrics: (1) token‑level crowding, defined as the probability‑weighted sum of cosine similarities between a given token and all other tokens; (2) step‑level crowding, the expectation of token‑level crowding under the current next‑token distribution; and (3) sequence‑level crowding, the average step‑level crowding across an entire generation. In practice they approximate these sums using the top‑K tokens (K=100) to keep computation tractable.
Empirical analysis is performed on the AIME25 mathematical reasoning benchmark using the Qwen3‑0.6B model. The authors generate 960 reasoning traces (32 samples per problem) with temperature = 1.0 and top‑p = 1.0. They find a clear monotonic decline in answer accuracy as crowding increases: low‑crowding sequences achieve 34.38 % correct, mid‑crowding 13.12 %, and high‑crowding only 1.56 %. A point‑biserial correlation of r = −0.39 (p ≈ 10⁻³⁶) confirms statistical significance. Moreover, when controlling for Shannon entropy—a standard uncertainty measure—crowding remains a strong negative predictor (odds ratio = 0.29, p = 0.001) while entropy does not (odds ratio = 0.63, p = 0.26). This demonstrates that crowding captures information orthogonal to traditional dispersion metrics.
Motivated by these findings, the authors propose CraEG (Crowding‑Aware Sampling via Embedding Geometry), a plug‑in, training‑free decoding augmentation. At each step, CraEG (i) selects a correction set Sₜ of tokens whose probabilities exceed a small threshold ε, (ii) computes token‑level crowding within Sₜ, and (iii) re‑weights each token’s probability by a factor (1 − α·Crowd_token(i)), where α is a step‑adaptive strength parameter. The re‑weighted distribution is then fed into any standard sampling strategy (e.g., top‑p, temperature). This operation requires only a single forward pass and no auxiliary models, making it compatible with existing pipelines.
The method is evaluated on several LLMs (Qwen3‑1.7B, Qwen3‑4B, HunYuan‑1.8B) across three challenging mathematical reasoning benchmarks. Compared with strong baselines (temperature = 1, top‑p = 1), CraEG yields consistent improvements: average @32 score rises by 0.52 points, pass@8 by 1.98 percentage points, distinct‑n diversity by 1.17 points, and semantic diversity by 0.62 points. These gains are observed even when the baseline already uses aggressive sampling settings, indicating that geometry‑aware re‑weighting provides complementary benefits to probability‑based smoothing.
The paper’s contributions are threefold: (1) identification and formalization of embedding‑space crowding as a decoding phenomenon; (2) quantitative evidence linking crowding to reduced reasoning performance; and (3) introduction of a simple, model‑agnostic mitigation technique (CraEG) that improves both accuracy and diversity without extra training or inference cost.
Limitations include the focus on mathematical reasoning; generalization to other generation tasks (code synthesis, summarization, dialogue) remains to be validated. Additionally, CraEG relies on static token embeddings; in scenarios where embeddings evolve (e.g., during fine‑tuning) the method may need adaptation. Hyperparameters ε and α may also require tuning for different model scales or domains.
In summary, the work highlights that the geometric arrangement of token probabilities is a crucial, previously ignored factor in LLM decoding. By explicitly addressing embedding‑space crowding, CraEG offers a practical and effective way to enhance complex reasoning capabilities, opening new avenues for geometry‑aware decoding research.
Comments & Academic Discussion
Loading comments...
Leave a Comment