DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.
💡 Research Summary
The paper “DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality” addresses the critical challenge of verifying the factuality of claims within Deep Research Reports (DRRs) generated by search-augmented LLM agents. DRRs are complex, expert-level syntheses that require multi-document reasoning, making their verification far more difficult than checking simple factoid claims. The authors identify a dual problem: existing automated fact-checkers are designed for general-domain, atomic claims and do not transfer well to DRRs, and there is a lack of a reliable benchmark to evaluate such verifiers in the DRR domain.
The core insight of the paper is that the traditional paradigm of static, expert-labeled benchmarks is fundamentally flawed for this high-cognitive-load task. Through a controlled study with PhD-level domain specialists, the authors demonstrate that even unassisted experts achieve only 60.8% accuracy on a hidden set of verifiable claims (the “micro-gold” set) within their own specialties. This exposes the brittleness of treating one-shot expert annotations as an infallible gold standard for DRR factuality.
To overcome this, the authors propose a novel paradigm called Evolving Benchmarking, implemented via the Audit-then-Score (AtS) protocol. AtS treats benchmark truth not as a fixed snapshot but as an evolving consensus that co-evolves with improving verification models. The protocol works in a loop:
- Evaluate: A challenger model makes predictions on the current benchmark consensus.
- Challenge: If the model’s verdict disagrees with the benchmark label, it must submit a formal proposal with supporting evidence.
- Audit: A human expert or a trusted agent auditor adjudicates the dispute. If the challenger’s evidence and rationale are superior to the existing benchmark rationale, the update is accepted.
- Evolve & Score: The benchmark consensus is updated with the accepted revisions. All models are then scored against this refined, more accurate benchmark.
The paper validates AtS in two key ways. First, it shows that humans are significantly more reliable as auditors than as one-shot labelers. Across four rounds of simulated AtS, expert accuracy on the micro-gold set rose from 60.8% to 90.9%. Second, it explores replacing the human auditor with a capable LLM agent, showing promising results and pointing toward a potential autonomous, self-improving evaluation ecosystem.
Building upon this methodological foundation, the authors introduce two concrete artifacts:
- DeepFact-Bench: A versioned DRR factuality benchmark constructed through multiple rounds of AtS. It includes source reports, current labels, and auditable rationales for each claim, enabling continuous refinement even after release.
- DeepFact-Eval: An advanced, multi-step verification agent designed for document-level fact-checking of DRRs. It comes in two variants: a strong, expert-level version and a “Grouped Lite” version that groups related claims for efficient verification, offering substantial cost and speed savings with minimal accuracy loss.
Empirical evaluations show that DeepFact-Eval significantly outperforms existing state-of-the-art fact-checking pipelines (e.g., by +27.5 accuracy points over SAFE) and repurposed deep-research verifiers on the DeepFact-Bench. Furthermore, DeepFact-Eval demonstrates strong transfer performance on external factuality datasets, with remaining discrepancies often attributable to label noise in those static benchmarks, thereby underscoring the value of auditable, evolving benchmarks like DeepFact-Bench.
In summary, this paper makes a significant contribution by reframing the problem of evaluating DRR factuality. It moves from a static “gold standard” model to a dynamic, collaborative, and evolutionary process that mirrors scientific discourse itself. The proposed AtS protocol, along with the DeepFact-Bench and DeepFact-Eval artifacts, provides a robust framework for developing and assessing more reliable verification tools for AI-generated deep research.
Comments & Academic Discussion
Loading comments...
Leave a Comment