AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the initial outline. However, constructing a comprehensive outline itself demands strong reasoning ability, causing current deep research systems to rely almost exclusively on closed-source or online large models. This reliance raises practical barriers to deployment and introduces safety and privacy concerns for user-authored data. In this work, we present AgentCPM-Report, a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process and an 8B-parameter deep research agent. Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines during report generation. Under this policy, the agent alternates between Evidence-Based Drafting and Reasoning-Driven Deepening, jointly supporting information acquisition, knowledge refinement, and iterative outline evolution. To effectively equip small models with this capability, we introduce a Multi-Stage Agentic Training strategy, consisting of cold-start, atomic skill RL, and holistic pipeline RL. Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems, with substantial gains in Insight.


💡 Research Summary

AgentCPM‑Report introduces a novel framework for generating open‑ended deep‑research reports using a lightweight 8‑billion‑parameter local model. The authors identify two major shortcomings of existing approaches: (1) retrieval‑then‑write pipelines often lose coherence over long horizons, and (2) plan‑then‑write pipelines rely on a high‑quality, comprehensive outline generated before any drafting begins. The latter creates an “insight ceiling” because the initial outline must already contain all necessary information, a requirement that small models struggle to meet. Consequently, most current deep‑research systems depend on closed‑source or online large models, raising deployment costs, privacy concerns, and limiting on‑device use.

To overcome these limitations, the paper proposes Writing As Reasoning Policy (WARP), a dynamic policy that treats research as a sequential decision‑making process. At each interaction loop the agent observes a global state consisting of the user query, a mutable outline, the current draft, and retrieved context. The action space includes INITIALIZE, SEARCH, WRITE, EXPAND, and TERMINATE. WARP alternates between two macro‑states:

  1. Evidence‑Based Drafting – The agent generates section‑specific search queries conditioned on the user query, the current outline entry, and the draft built so far. Retrieved documents are then used to ground the WRITE operation, ensuring factual consistency and a smooth integration of evidence.

  2. Reasoning‑Driven Deepening – After a drafting pass, the agent evaluates the draft for logical gaps or insufficient depth. If a section is identified as shallow, the agent decides to EXPAND that section, creating finer‑grained sub‑sections and updating the outline. The process repeats until the agent’s TERMINATE decision deems the draft sufficiently dense and coherent.

Because WARP is a learned policy rather than a fixed heuristic, it faces long‑horizon credit assignment and a vastly expanded action space. To train a small model effectively, the authors design a Multi‑Stage Agentic Training pipeline:

  • Cold‑Start (Supervised Fine‑Tuning) – Establishes basic instruction following and format compliance.
  • Atomic Skill Reinforcement Learning – Decomposes the overall objective into atomic abilities (initialization, search, writing, expansion, termination). Separate reward functions are defined for each skill, combining basic property checks, holistic quality metrics, and faithfulness assessments (see Table 1). This stage stabilizes low‑level behaviors before tackling the full task.
  • Holistic Pipeline Reinforcement Learning – Optimizes end‑to‑end report quality, using high‑level metrics such as comprehensiveness, insight, and faithfulness. Rewards are propagated back through the entire trajectory, encouraging the agent to trigger deepening only when it yields a measurable informational gain.

A key innovation for handling the ambiguous stopping problem is Trajectory Pruning. Teacher trajectories are deliberately over‑expanded; the optimal stopping point is identified post‑hoc as the draft with the highest holistic score, and the trajectory is truncated at that point with a TERMINATE label. This provides a clear supervision signal for when the agent should cease further expansion.

Implementation uses MiniCPM‑4.1‑8B as the backbone. The system caps outline depth at three levels and limits deepening steps to twelve, balancing performance and efficiency. Experiments are conducted on three benchmarks:

  • DeepResearch Bench – 100 PhD‑level scientific tasks.
  • DeepConsult – 102 business and financial analysis queries.
  • DeepResearch Gym – 100 general information‑seeking tasks.

Across all benchmarks, AgentCPM‑Report outperforms leading closed‑source systems (e.g., Gemini‑2.5‑Pro) on the Insight metric, achieving average improvements of 12–18 %. It also matches or exceeds these baselines on faithfulness and comprehensiveness, despite using a model an order of magnitude smaller. Qualitative analysis shows the agent can discover novel connections and refine arguments during the deepening phase, demonstrating genuine reasoning beyond mere template filling.

The paper’s contributions are threefold: (1) a dynamic, policy‑driven framework that interleaves drafting and reasoning, eliminating the static planning bottleneck; (2) a curriculum‑based reinforcement learning strategy that resolves long‑horizon credit assignment for small models; and (3) empirical evidence that an 8 B local model can achieve deep‑research performance comparable to proprietary large models while preserving data privacy and enabling on‑device deployment.

Limitations include increased computational overhead due to repeated search‑write‑deepening loops and the current restriction to three‑level outlines and a fixed maximum number of deepening steps, which may constrain very complex topics. Future work could explore multimodal retrieval, knowledge‑graph integration, and more scalable RL techniques to further enhance depth and breadth.

In summary, AgentCPM‑Report demonstrates that with a carefully designed dynamic policy and staged training, small open‑source language models can perform sophisticated, insight‑rich research reporting without reliance on external large models, opening the door for privacy‑preserving, cost‑effective AI‑assisted research tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment