HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-BENCH

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-BENCH
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the “Long-Context Tax” and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders (“reasonable hesitation”). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). Validated on industrial-scale Mixture-of-Experts (MoE) models across varying context windows (32K/128K), our approach demonstrates superior robustness and predictive power. This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.


💡 Research Summary

**
The paper addresses a critical gap in the development of large language models (LLMs) for complex software engineering tasks as measured by the SWE‑Bench benchmark. While the acquisition of such capabilities occurs primarily during a “mid‑training” phase—where domain‑specific, high‑density code data are ingested with extended context windows (32 K to 128 K tokens)—the community lacks a reliable, low‑latency metric to monitor progress. Conventional metrics such as perplexity (PPL) and bits‑per‑character (BPC) suffer from two major shortcomings. First, when the context length is increased, positional‑embedding frequency scaling (e.g., linear RoPE or YARN) temporarily distorts the attention distribution, inflating predictive entropy and causing a sharp rise in PPL. The authors term this phenomenon the “Long‑Context Tax.” Although the model’s true competence may remain unchanged, PPL misleadingly suggests regression. Second, PPL correlates strongly with Top‑1 accuracy but its relationship deteriorates for higher‑order Top‑k (k > 1) metrics. SWE‑Bench tasks, however, often only require the correct token to appear within a modest candidate set (e.g., Top‑10), making PPL an inadequate proxy for downstream performance.

To overcome these limitations, the authors make three core contributions.

  1. Data‑efficient evaluation protocol – They curate 500 successful SWE‑Bench trajectories, then apply a token‑level filtering pipeline that removes “Thought” and “Observation” sections, strips XML/markdown tags, eliminates comments via abstract‑syntax‑tree parsing, and discards extraneous whitespace. This yields a clean set of functional “Action” tokens (≈12.5 M tokens) that capture the essence of the model’s problem‑solving behavior while being cheap to evaluate.
  2. Entropy Compression Hypothesis – By examining the distribution of the top‑10 token entropy across checkpoints, the authors discover that high‑entropy tokens cluster around specific logarithmic values (ln 2, ln 3, ln 4). Superior models exhibit a pronounced “Shift to ln 3” for tokens that miss the top‑2 predictions, indicating that the model compresses residual uncertainty into roughly three plausible alternatives. This “reasonable hesitation” reflects a structured form of intelligence that goes beyond scalar loss minimization.
  3. HE‑SNR (High‑Entropy Signal‑to‑Noise Ratio) – Building on the hypothesis, they define a metric that isolates the average entropy of missed high‑entropy tokens (the “signal”) and divides it by the average entropy of all top‑10 tokens (the “noise”). A higher HE‑SNR implies that, even when the model is uncertain, it concentrates probability mass on a small, meaningful set of candidates rather than scattering it uniformly.

The experimental suite uses two proprietary Mixture‑of‑Experts (MoE) models: MoE‑S (32 K context, linear RoPE scaling) and MoE‑L (128 K context, YARN scaling), spanning a 10× scale range. Across multiple checkpoints, the authors report Pearson correlations above 0.92 between HE‑SNR and SWE‑Bench Pass@1, dramatically outperforming PPL, whose correlation deteriorates especially during the Long‑Context Tax phase. Notably, in MoE‑L the Long‑Context Tax causes PPL and Top‑1 accuracy to drop sharply, yet Top‑10 accuracy remains stable and HE‑SNR stays flat, while SWE‑Bench performance actually improves—demonstrating that the tax merely redistributes probability mass without harming competence.

Furthermore, after supervised fine‑tuning (SFT), overall PPL improves but HE‑SNR declines because high‑entropy tokens experience entropy inflation. The authors label this the “Alignment Tax,” arguing that SFT optimizes for surface pattern matching at the expense of deep reasoning, thereby reducing the model’s ability to maintain structured uncertainty.

In summary, the paper proposes a theoretically grounded, empirically validated metric—HE‑SNR—that reliably predicts LLM performance on complex software engineering tasks during mid‑training, is robust to context‑length induced artifacts, and reveals subtle trade‑offs introduced by fine‑tuning. The work opens avenues for entropy‑driven curriculum design, automated early‑stopping criteria, and cross‑domain extensions of the metric to other reasoning‑heavy benchmarks.


Comments & Academic Discussion

Loading comments...

Leave a Comment