Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.


💡 Research Summary

The paper tackles the fundamental challenge of training and evaluating deep‑research (long‑form report) generation systems, where verifiable reward signals are scarce and human evaluation is costly. Existing solutions either rely on coarse, pre‑defined rubrics that lack granularity or on manually crafted query‑specific rubrics that do not scale. To bridge this gap, the authors propose a pipeline that learns query‑specific rubric generators directly from human preference data and uses these rubrics as fine‑grained reward models in reinforcement‑learning (RL) training of deep‑research agents.

First, they construct a large preference dataset comprising over 5,000 research‑style queries. Each query is paired with two candidate reports generated by strong LLMs (e.g., DeepSeek V3.1, Tongyi‑DeepResearch). Human annotators label which report they prefer based on usefulness, coherence, completeness, and alignment with the query. This dataset provides the ground‑truth signal for learning rubric generators.

The rubric generator is trained with Group Relative Policy Optimization (GRPO), a policy‑gradient method that optimizes relative rankings. The reward function is hybrid: (1) a Preference Consistency reward that encourages the generated rubric to assign higher scores to the human‑preferred report and lower scores to the rejected one, and (2) an LLM‑as‑Judge reward that evaluates the rubric itself for clarity, applicability, and discriminativeness using a powerful pre‑trained LLM. A small format‑validity term ensures the output is well‑structured JSON. By combining these signals, the generator learns to produce rubrics that are both human‑aligned and logically sound.

To handle the long‑horizon reasoning required for report synthesis, the authors introduce the Multi‑agent Markov‑state (MaMs) workflow. MaMs decomposes the task into three cooperating agents: a search agent that calls external tools and retrieves evidence, a state‑update module that maintains a Markovian representation of the current evidence and plan, and a report‑generation agent that writes structured sections (introduction, methods, results, discussion). The agents interact in a Markov‑state loop, allowing the system to manage long contexts and multi‑step reasoning more robustly than traditional ReAct pipelines. Crucially, after each rollout the query‑specific rubric generated by the learned generator is applied to score the draft report, providing immediate, fine‑grained RL feedback.

Empirical evaluation proceeds along two dimensions. First, rubric quality is measured against three baselines: (a) generic pre‑defined rubrics, (b) expert‑crafted query‑specific rubrics, and (c) LLM‑generated rubrics without human‑preference grounding. The learned rubrics achieve a 12 % increase in preference‑consistency score and a 9 % boost in LLM‑as‑Judge quality metrics over the best baseline. Second, the rubrics are integrated into the training of deep‑research agents on the DeepResearch Bench. When combined with the MaMs workflow, the agents consistently outperform all open‑source baselines (e.g., WebWeaver, DrTulu) and approach the performance of leading closed‑source models such as GPT‑4o and Claude‑3, often within a 1–2 % margin. Notably, on complex domains like law, medicine, and business, the correlation between model scores and human judgments rises to 0.78, indicating strong alignment.

The paper also discusses limitations and future directions. Scaling the preference dataset to hundreds of thousands of queries will require active‑learning or semi‑automated labeling strategies. Generalization to completely unseen domains remains a challenge; meta‑learning or zero‑shot adaptation of the rubric generator could address this. Finally, incorporating ensembles of LLM judges may further stabilize rubric evaluation.

In summary, the work demonstrates that learning rubric generators from human preferences and using them as RL rewards provides a principled, scalable, and human‑aligned pathway for training high‑quality deep‑research report generators. The combination of a hybrid reward, GRPO training, and the MaMs multi‑agent workflow yields both superior rubric supervision and state‑of‑the‑art generation performance, establishing a new paradigm for aligning long‑form LLM outputs with human expectations.


Comments & Academic Discussion

Loading comments...

Leave a Comment