Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals
Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.
💡 Research Summary
The paper tackles a fundamental problem in evaluating large language models (LLMs) on mathematical reasoning tasks: traditional accuracy estimates based solely on ground‑truth correctness are extremely noisy when benchmark sizes are small and model outputs are stochastic. Even a single mis‑prediction can swing reported accuracy by several percentage points, leading to unstable model rankings. The authors observe that, especially on hard problems, an LLM often fails to produce the correct final answer but can reliably judge which of two candidate solutions is better. This “generation‑verification gap” suggests that pairwise comparison signals contain useful information beyond the binary correctness label.
To exploit this, the authors augment the standard evaluation dataset ({(x_i, y_i, g_i)}{i=1}^N) with auxiliary information (Z_i = (w{1i}, w_{2i}, v_i)). Here (w_{1i}, w_{2i}) are two reasoning chains generated by auxiliary LLMs, and (v_i) is a binary preference output from the target LLM indicating which chain it deems superior. Crucially, the conditional distribution (p(z|x)) of these auxiliary signals is fully known because the generation process is under experimental control; it can be approximated arbitrarily well by repeated Monte‑Carlo sampling.
The authors cast the problem in a semiparametric framework. The target parameter is (\theta = \mathbb{E}
Comments & Academic Discussion
Loading comments...
Leave a Comment