EDIS: Diagnosing LLM Reasoning via Entropy Dynamics

EDIS: Diagnosing LLM Reasoning via Entropy Dynamics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Entropy-based confidence signals are increasingly leveraged to improve reasoning in large language models (LLMs), yet existing approaches treat confidence as a static quantity – typically aggregated over tokens. We show that the \emph{temporal evolution} of confidence during generation carries richer information than aggregate statistics alone. Analyzing token-level entropy trajectories, we identify characteristic patterns distinguishing correct from incorrect reasoning: erroneous solutions exhibit unstable dynamics, including burst spikes (sustained uncertainty growth) and peak-valley spikes (sharp rebounds following transient confidence). These patterns persist across models and training stages, suggesting they reflect intrinsic properties of reasoning failure rather than superficial noise. To formalize this observation, we introduce the Entropy Dynamics Instability Score (\textbf{EDIS}), a trajectory-level metric quantifying instability in entropy evolution. EDIS serves as an effective diagnostic signal for inference-time selection, substantially improving reasoning accuracy, and offers a promising direction for training-time sample curation. Our findings establish entropy dynamics as an underexplored yet informative lens for understanding and improving LLM reasoning.


💡 Research Summary

The paper tackles a fundamental problem in large language models (LLMs): how to tell whether a generated reasoning trace is correct without external verification. Existing confidence‑based methods collapse the model’s uncertainty into a single scalar—typically the mean token‑level entropy or the entropy of the final token. The authors argue that this static view discards crucial temporal information that is inherent to autoregressive generation.

To expose the hidden signal, they introduce the notion of an entropy trajectory, the ordered list of token‑level entropies (H_t) produced as the model generates each token. By visualizing and statistically analysing these trajectories on a large set of mathematical reasoning problems, they discover two characteristic instability patterns that are far more prevalent in incorrect solutions:

  1. Burst spikes – a sustained rise in entropy over a sliding window of (w) tokens, indicating that the model’s confidence deteriorates progressively as it continues to generate.
  2. Peak‑valley (rebound) spikes – a V‑shaped pattern where entropy first drops to a local minimum (a fleeting moment of high confidence) and then sharply rebounds, reflecting false confidence followed by renewed uncertainty.

These patterns are robust across three different LLM families (Qwen2.5‑Math‑1.5B, Qwen3‑4B‑Instruct, Qwen2.5‑Math‑7B), three temperature settings (0.2, 0.6, 1.0), and multiple training checkpoints. Incorrect answers exhibit 1.7–3.6× more entropy fluctuations than correct ones (Cohen’s d ≈ 1.0).

From these observations the authors devise the Entropy Dynamics Instability Score (EDIS), a trajectory‑level metric that combines the frequency of both spike types with the overall variance of the entropy sequence:

\


Comments & Academic Discussion

Loading comments...

Leave a Comment