Beyond Retrieval: A Modular Benchmark for Academic Deep Research Agents

Beyond Retrieval: A Modular Benchmark for Academic Deep Research Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents. To address these gaps, we introduce ADRA-Bank, a modular benchmark for Academic DR Agents. Grounded in academic literature, our benchmark is a human-annotated dataset of 200 instances across 10 academic domains, including both research and review papers. Furthermore, we propose a modular Evaluation Paradigm for Academic DR Agents (ADRA-Eval), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an end-to-end evaluation for \task agents and an isolated evaluation for foundational LLMs as potential backbones. Results reveal uneven capabilities: while agents show specialized strengths, they struggle with multi-source retrieval and cross-field consistency. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, ADRA-Bank provides a diagnostic tool to guide the development of more reliable automatic academic research assistants.


💡 Research Summary

The paper addresses the pressing need for automated deep‑research (DR) agents capable of handling the ever‑growing volume of academic literature, and it highlights a critical gap in current evaluation practices: most benchmarks focus narrowly on information retrieval and ignore the higher‑level planning and reasoning capabilities that are essential for genuine scholarly work. To fill this gap, the authors introduce two major contributions.

First, they present ADRA‑Bank, a human‑annotated benchmark consisting of 200 instances drawn from ten distinct academic domains (materials, finance, chemistry, computer science, medicine, biology, environmental science, energy, building & construction, and earth science). Each instance includes a realistic research query (Q), a gold‑standard research plan (T*), a set of gold evidence citations (E+*), and a suite of diagnostic question‑answer pairs (D) designed to probe factual correctness and depth of reasoning. The source papers are all post‑2024, have at least ten citations, and were selected by senior Ph.D. candidates to ensure quality, recency, and disciplinary balance.

Second, the authors propose ADRA‑Eval, a modular‑integrated evaluation paradigm that decomposes any DR system into three core modules: a planner (π) that converts Q into an ordered sub‑task sequence T, a retriever (ρ) that, under a budget B, gathers evidence E+ from a corpus, and a reasoner (σ) that synthesizes the evidence into a comprehensive report R. For each module they define dedicated metric suites—Mπ (coverage, structural correctness, redundancy of sub‑tasks), Mρ (relevance, coverage, provenance of citations using exact DOI matching), and Mσ (accuracy, consistency, depth, breadth measured against the diagnostic set D).

ADRA‑Eval operates in two complementary modes. In the end‑to‑end mode, the modules are chained together, allowing the measurement of a system’s overall performance and the propagation of upstream errors. In the isolated mode, each module receives the gold‑standard inputs (T* and/or E+*) so that its intrinsic capability can be assessed independently of upstream mistakes. This dual‑mode design enables a clear attribution of performance bottlenecks to either the underlying large language model (LLM) backbone or to the system‑level orchestration.

The authors also incorporate efficiency metrics: processing latency, token count of the final report, and monetary cost, analyzing trade‑offs via Pareto frontier methods.

Empirical evaluation covers several state‑of‑the‑art commercial DR agents (e.g., OpenAI’s ChatGPT, Google Gemini) and a variety of foundational LLMs (GPT‑4o, Gemini, Llama‑2, etc.). The results reveal a fragmented capability landscape. While some agents excel at retrieval (e.g., Grok) and others at reasoning (e.g., Gemini), all struggle with multi‑source retrieval—especially from review papers—and with maintaining cross‑field consistency. A key finding is that improvements in high‑level planning dramatically boost reasoning performance; a well‑structured plan reduces unnecessary searches and mitigates error propagation, thereby allowing the reasoner to generate more factually accurate, coherent reports.

The paper acknowledges limitations. The dataset size (200 instances) may not fully capture the diversity of real‑world scholarly tasks, and reliance on DOI‑based ground truth limits applicability to non‑traditional evidence such as preprints, patents, datasets, or code repositories. Rule‑based metrics may penalize novel but correct evidence not present in the gold set. Nonetheless, the modular‑isolated evaluation combined with end‑to‑end testing offers a powerful diagnostic tool for developers, pinpointing exactly where a DR system fails.

In conclusion, ADRA‑Bank and the ADRA‑Eval framework constitute a significant step toward standardized, fine‑grained assessment of academic DR agents. By exposing actionable failure modes—particularly in planning and multi‑source retrieval—the work provides clear research directions. Future extensions could enlarge the benchmark, incorporate non‑citation evidence, and move toward learned evaluation metrics, potentially establishing ADRA‑Bank as the de‑facto standard for trustworthy AI‑driven scholarly assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment