ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review
Automated peer review has evolved from simple text classification to structured feedback generation. However, current state-of-the-art systems still struggle with “surface-level” critiques: they excel at summarizing content but often fail to accurately assess novelty and significance or identify deep methodological flaws because they evaluate papers in a vacuum, lacking the external context a human expert possesses. In this paper, we introduce ScholarPeer, a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher. ScholarPeer employs a dual-stream process of context acquisition and active verification. It dynamically constructs a domain narrative using a historian agent, identifies missing comparisons via a baseline scout, and verifies claims through a multi-aspect Q&A engine, grounding the critique in live web-scale literature. We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations and reduces the gap to human-level diversity.
💡 Research Summary
The paper addresses the growing bottleneck in scholarly peer review caused by the massive influx of submissions to major AI conferences. Existing automated review systems, which rely on large language models (LLMs) fine‑tuned on static paper‑review pairs or reinforced with reinforcement learning, are limited to surface‑level summarization. They lack the ability to place a manuscript in the broader, ever‑evolving research landscape, resulting in “vacuum” evaluations that miss novelty, significance, and methodological flaws.
To overcome these limitations, the authors propose ScholarPeer, a search‑enabled multi‑agent framework that mimics the cognitive workflow of a senior researcher. ScholarPeer decomposes the review task into two complementary streams: (1) Knowledge Acquisition & Contextualization and (2) Active Verification. Each stream is handled by specialized agents that interact through structured JSON messages.
Knowledge Acquisition & Contextualization consists of four agents:
- Summary Agent – compresses the full manuscript into a structured representation ˆS containing core claims, methods, and evidence, reducing internal token load for downstream agents.
- Literature Review & Expansion Agent – identifies the paper’s sub‑domain, performs an initial web‑scale search via a Google‑search‑enabled LLM, and iteratively expands the search to capture recent pre‑prints, blog posts, GitHub repos, and other non‑standard sources. This ensures the system has up‑to‑date external knowledge.
- Sub‑Domain Historian Agent – organizes retrieved documents into a chronological “domain narrative,” allowing the system to assess the trajectory of ideas and judge whether the submission represents an incremental improvement or a paradigm shift.
- Baseline Scout Agent – parses the experimental setup, then independently searches for state‑of‑the‑art baselines and datasets relevant to the task, returning a list of omitted comparisons. This mimics the human reviewer’s habit of pointing out missing baselines.
Active Verification is performed by the Multi‑Aspect Q&A Engine, which adopts a skeptical persona. Using the outputs of the acquisition agents, it generates probing questions (Q_probe) targeting potential weaknesses in novelty, technical soundness, and evidence. For each question, the engine (i) self‑answers based on ˆS, (ii) cross‑checks the answer against the domain narrative and baseline list, and (iii) logs any discrepancy. The resulting interrogation log serves as concrete, evidence‑backed criticism rather than generic commentary.
The Review Generator Agent finally synthesizes a full review by conditioning on (a) the structured summary, (b) the verified Q&A pairs, and (c) explicit review guidelines (e.g., ICLR checklist, NeurIPS criteria). By swapping the guideline prompt, ScholarPeer can produce venue‑specific reviews without retraining.
Experimental Setup: The authors evaluate on the DeepReview‑13K test split (1,286 papers from ICLR 2024‑2025). Baselines are divided into (1) fine‑tuned models (CycleReviewer 8B/70B, DeepReviewer 7B/14B) and (2) agentic systems (AgentReview, AI Scientist v2) built on Gemini 3 Flash, Gemini 3 Pro, and Claude Sonnet 4.5. ScholarPeer uses Gemini 3 Pro as the core LLM for all agents, while the literature expansion and historian agents employ a Google‑search‑enabled LLM that can parse non‑standard academic sources.
Metrics: Win‑rate against each baseline, H‑Max score (calibrated against human reviews, with human upper bound 5.0), and Spearman correlation with human ranking.
Results: ScholarPeer outperforms all baselines across every metric. Win‑rates exceed 97 % for most categories, with especially high performance in Significance Assessment and Constructive Value (≈98 %). The average H‑Max score reaches 4.41, approaching human expert levels, and Spearman correlation peaks at 0.42, surpassing the best prior agentic system (≈0.31). Table 2 shows dominant superiority in Technical Accuracy, Analytical Depth, and Overall Judgment.
Implementation Details: The system avoids static APIs (e.g., Semantic Scholar) and directly issues LLM‑driven web queries, enabling retrieval of the latest, possibly non‑indexed material. All agents run in parallel where possible, and the Q&A engine operates iteratively until convergence or a preset query budget is exhausted.
Limitations: (1) Real‑time web search introduces latency and cost; caching and asynchronous pipelines are needed for production deployment. (2) The quality of retrieved documents is not fully vetted, raising the risk of propagating misinformation. (3) Current adaptability is limited to swapping guideline prompts; broader multi‑venue formatting requires further engineering.
Future Work: The authors suggest integrating evidence‑weighting mechanisms, building a citation‑graph‑based confidence model, and extending the framework to support a wider range of conference/journal formats.
Conclusion: ScholarPeer demonstrates that a context‑aware, multi‑agent architecture can bridge the gap between static LLM reviewers and human experts. By actively acquiring up‑to‑date literature, constructing a domain narrative, auditing missing baselines, and interrogating claims through a structured Q&A process, the system delivers deep, diverse, and human‑like reviews. The empirical gains on DeepReview‑13K indicate a substantial step toward scalable, high‑quality automated peer review, with promising avenues for further refinement and broader adoption.
Comments & Academic Discussion
Loading comments...
Leave a Comment