Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling
An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or “rubrics”, demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models’ task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
💡 Research Summary
The paper tackles a fundamental bottleneck in using large language models (LLMs) as judges for reasoning tasks: the inability of LLMs to reliably spot errors in long, domain‑specific, or non‑verifiable reasoning traces. To address this, the authors propose a fully data‑driven pipeline that automatically extracts a fine‑grained taxonomy of reasoning errors—called a “rubric”—directly from model‑generated failures.
Problem Setting
Given a dataset of (question, ground‑truth answer, model reasoning trace, binary correctness label) tuples, the goal is to produce a set of rubric items, each consisting of a concise error description, a high‑recall keyword, and detailed verification cues. The rubric is organized hierarchically: a first pass matches keywords against a compressed version of a new trace, and a second pass applies the full, more specific items only to those traces that triggered the keywords. This two‑stage classification keeps inference cost low even when the rubric contains hundreds of items.
Methodology
- Trace Compression: The raw reasoning trace is summarized by an LLM to retain only logical steps that influence the final answer, discarding exploratory detours that do not affect correctness.
- Rubric Generation: Compressed incorrect traces are fed to an LLM together with the problem statement and the correct solution. The model is prompted to list the underlying error(s). Each generated item follows a strict template (error description ≤25 words, a keyword for recall, and one or more verification details).
- Keyword Consolidation: To reduce redundancy, a second LLM groups related keywords, typically cutting the keyword set by about 50 %.
Experiments
The authors evaluate three research questions: (R1) Does a rubric improve trace‑classification specificity and overall accuracy? (R2) Can rubric‑augmented classifiers serve as stronger reward functions than standard LLM‑as‑judge rewards? (R3) How do rubric‑based rewards compare to verifiable rewards (e.g., string‑matching) on downstream task performance?
Experiments span technical domains (coding, mathematics, chemical engineering) and a non‑technical qualitative domain, using the Meta NaturalReasoning dataset. For R1, rubric‑enhanced LLM judges achieve up to an 11.6 % absolute gain in classification accuracy, primarily driven by higher recall of subtle error patterns. For R2 and R3, the authors train reasoning models via reinforcement learning (RL) where the reward is 1 only if no rubric item applies. Compared to a baseline LLM‑as‑judge reward, the rubric‑based reward yields a 45 % relative improvement in downstream task accuracy. Remarkably, when the amount of gold‑label data is reduced to 20 % of the full set, the rubric‑based RL still matches the performance of models trained with fully verifiable rewards (e.g., RL‑VR string‑matching).
Key Insights
- Automatic extraction of error taxonomies from model failures is feasible and yields interpretable, reusable artifacts.
- The two‑stage keyword‑then‑detail classification architecture balances recall and precision while keeping computational overhead modest.
- Rubric‑derived rewards provide a form of “structured supervision” that is more data‑efficient than raw preference or binary correctness signals.
- Even with limited gold labels, rubric‑based RL can approach the ceiling set by fully verifiable reward functions, suggesting a path toward cost‑effective training in expert domains where labeling is expensive.
Limitations and Future Work
The quality of the generated rubric depends on the underlying LLM used for extraction; systematic errors or biases in that model could propagate into the rubric. The compression step may discard nuanced logical dependencies, especially in multi‑step mathematical proofs or complex chemical reaction pathways, potentially leading to missed error categories. Future research directions include (i) ensemble‑based error extraction to improve robustness, (ii) structured summarization techniques that preserve logical proof trees, and (iii) online rubric updating mechanisms that allow the taxonomy to evolve as the target model improves.
Conclusion
By automatically constructing domain‑specific error rubrics and integrating them into LLM‑judge pipelines, the paper demonstrates a practical route to more reliable reasoning verification and more data‑efficient reinforcement‑learning reward modeling. The approach dramatically reduces the need for extensive human‑annotated correctness labels while still delivering near‑state‑of‑the‑art downstream performance in technical reasoning tasks. This work opens a promising avenue for scaling LLM‑based reasoning systems into high‑stakes, expert domains where gold‑standard supervision is scarce or prohibitively costly.
Comments & Academic Discussion
Loading comments...
Leave a Comment