Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

💡 Research Summary

The paper tackles the long‑standing limitation of scalar reward models in non‑verifiable domains such as creative writing or open‑ended instruction following, where a single numeric score cannot capture the multi‑faceted nature of response quality. Recent work has introduced rubric‑based reward modeling, which decomposes evaluation into structured criteria, but existing pipelines either rely on costly human‑written rubrics or treat rubric generation and judgment as separate, static components. Consequently, they cannot adapt rubrics to the target domain or jointly improve the underlying preference distribution.

Rubric‑ARM (Rubric‑based Alternating Reinforcement Learning for Reward Modeling) proposes a unified, end‑to‑end framework that treats both the rubric generator (π_r) and the judge (π_j) as stochastic policies learned via reinforcement learning. The rubric is modeled as a latent action sampled from the prompt, and the judge produces a reasoning chain and a binary preference conditioned on the prompt, the two candidate responses, and the generated rubric. The objective is to maximize expected preference‑correctness R(o, o*) = I

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

💡 Research Summary

Comments & Academic Discussion

Leave a Comment