R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging

R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM’s preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.


💡 Research Summary

The paper “R‑Align: Enhancing Generative Reward Models through Rationale‑Centric Meta‑Judging” investigates a critical blind spot in current Reinforcement Learning from Human Feedback (RLHF) pipelines: generative reward models (GenRMs) are typically trained and evaluated only on the final preference label, while the quality of the intermediate natural‑language rationale they generate is ignored. The authors define “Spurious Correctness” (S‑Corr) as the phenomenon where a GenRM predicts the correct preference but justifies it with a flawed or superficial rationale that does not align with a gold‑standard judgment.

To quantify S‑Corr, the authors repurpose three widely used reward‑model benchmarks (HelpSteer3, RewardBench2, PPE‑Preference). They augment each sample with a gold rationale generated by Gemini‑3‑Pro and introduce a Meta‑Reward Model (MetaRM) that checks logical alignment between a GenRM’s generated rationale and the gold rationale, outputting a binary alignment flag. Three metrics are reported: standard label accuracy (L‑Acc), Spurious Correctness (the fraction of correct‑label cases that fail the alignment test), and Fidelity Score (F‑Score), which requires both correct label and aligned rationale.

Empirical analysis shows that even top‑performing open‑source and proprietary GenRMs exhibit substantial S‑Corr (20‑40% on average). Moreover, higher S‑Corr correlates with policy degeneration during RLHF: two GenRMs with nearly identical L‑Acc (Qwen3‑14B and RRM‑32B) lead to dramatically different downstream policies, the latter collapsing due to exploitation of spurious cues. The authors also observe that larger models and “thinking” (Chain‑of‑Thought) prompting reduce S‑Corr, indicating that stronger reasoning abilities help align rationale with true decision criteria.

To address this, the paper proposes R‑Align, a training framework that (i) augments the training set with gold rationales that explicitly state the valid decision basis, and (ii) adds a supervision loss on the generated rationale, penalizing misalignment even when the final label is correct. The MetaRM is used during training to provide automatic alignment feedback, turning the rationale generation into a supervised task.

Experiments with 8‑billion and 14‑billion parameter models demonstrate that R‑Align reduces S‑Corr dramatically (often below 30%) and improves F‑Score by 5‑10 percentage points. When these R‑Aligned reward models supervise RLHF, the resulting actor models achieve consistent gains across diverse domains—STEM, coding, instruction‑following, and general chat—improving average scores by 3‑7 points compared to baselines. The gains are observed both for open‑source models (Qwen3‑8B, Qwen3‑14B) and for larger proprietary models (GPT‑5).

The paper also conducts ablation studies confirming that the rationale‑centric loss is the primary driver of improvement, and that the MetaRM’s alignment judgments correlate well with human annotations. Analysis of policy behavior shows that models trained with R‑Align are less prone to reward hacking; they no longer over‑optimize superficial features like bullet‑point formatting, but instead improve the substantive quality of responses.

In summary, the work makes three major contributions: (1) a rationale‑aware benchmarking suite that exposes the prevalence of spurious correctness in current GenRMs; (2) the R‑Align framework that explicitly supervises reasoning traces, substantially lowering spurious correctness; and (3) empirical evidence that rationale‑centric reward modeling yields more robust and higher‑performing RLHF policies. The findings argue convincingly that future reward‑model design must evaluate and train on both “what” (the label) and “why” (the rationale) to achieve reliable alignment of large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment