PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review

PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Evaluating large language models typically relies on human-authored benchmarks, reference answers, and human or single-model judgments, approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis. We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses and aggregate dense peer assessments into relative performance estimates, without human supervision or gold references. PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments. In a large-scale study over 12 commercially available models and 420 autonomously generated questions, PeerRank produces stable, discriminative rankings and reveals measurable identity and presentation biases. Rankings are robust, and mean peer scores agree with Elo. We further validate PeerRank on TruthfulQA and GSM8K, where peer scores correlate with objective accuracy. Together, these results suggest that bias-aware peer evaluation with selective web-grounded answering can scale open-world LLM assessment beyond static and human curated benchmarks.


💡 Research Summary

**
The paper introduces PeerRank, a fully autonomous, multi‑agent framework for evaluating large language models (LLMs) without any human‑authored benchmarks, reference answers, or human judges. In PeerRank, each participating model simultaneously plays three symmetric roles: (1) it generates a set of evaluation questions drawn from five predefined categories (factual knowledge, reasoning/logic, current events, creative/open‑ended, practical how‑to); (2) it answers all questions using live web retrieval, but only for the “current events” category, with the retrieved snippets injected as hidden context; and (3) it judges every other model’s answer on a 1‑10 scale. Crucially, the judging phase is performed with web access disabled, ensuring that scores reflect the answer itself rather than additional evidence.

To control systematic biases, PeerRank implements three evaluation regimes: (i) shuffle‑only (random answer order, identities visible), (ii) blind‑only (fixed order, identities hidden), and (iii) shuffle+blind (random order, identities hidden). The shuffle+blind condition serves as the baseline with minimal bias. The framework quantifies three bias dimensions: self‑bias (difference between a model’s self‑score and its baseline peer score), name bias (effect of revealing the model’s identity), and position bias (effect of answer order).

Two aggregation methods are used. The primary metric, the “peer score,” is the mean of all peer‑assigned scores excluding self‑evaluations. As a robustness check, the authors also convert scores into pairwise win‑loss outcomes and compute Elo ratings. The two rankings correlate strongly (Pearson r = 0.844, Spearman ρ = 0.755), indicating that the results are stable across aggregation schemes.

The authors evaluate 12 commercially available LLMs (including variants of GPT‑5, Claude, Gemini, Grok, DeepSeek, Llama‑4, Sonar, Kimi, and Mistral) on a total of 420 autonomously generated questions (35 per model). The study finds that PeerRank yields discriminative and stable rankings, and that most models exhibit measurable self‑bias and position bias, with some also showing name bias.

External validation is performed on two established benchmarks with ground truth: TruthfulQA (264 multiple‑choice questions) and GSM8K (611 math problems). Under the same shuffle+blind judging protocol, PeerRank’s peer scores correlate positively with objective accuracy (≈0.71 for TruthfulQA and ≈0.68 for GSM8K), demonstrating that the relative judgments produced by the peer system align with absolute correctness.

The paper discusses several implications. First, by making the entire evaluation pipeline endogenous, PeerRank removes the need for costly human curation and enables continuous, up‑to‑date assessment that matches real‑world, web‑enabled deployments. Second, the explicit bias‑control protocols turn judge behavior into a first‑class measurement, addressing known issues of LLM‑as‑judge systems such as order effects and self‑preference. Third, the combination of web‑grounded answering and blind peer evaluation balances ecological validity (models can retrieve up‑to‑date information) with comparability (judges cannot exploit the retrieval step).

Limitations are acknowledged: the question generation process may inherit the models’ own topic preferences, potentially skewing difficulty distribution; web grounding is limited to the “current events” category, so the framework does not test retrieval across all domains; and the reliance on a single external retrieval provider per run may introduce provider‑specific biases. Future work could explore difficulty‑balancing generation, multi‑provider retrieval ensembles, and longitudinal updates to the autonomous benchmark.

In conclusion, PeerRank offers a novel, scalable approach to LLM evaluation that integrates live web grounding, multi‑model peer assessment, and systematic bias measurement. Its empirical results across a diverse set of commercial models and external benchmarks suggest that autonomous peer evaluation can produce reliable, bias‑aware rankings that correlate with objective performance, paving the way for sustainable, open‑world LLM assessment beyond static, human‑curated benchmarks.


Comments & Academic Discussion

Loading comments...

Leave a Comment