What Makes a Reward Model a Good Teacher? An Optimization Perspective

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

💡 Research Summary

Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. The standard RLHF pipeline consists of two stages: (1) training a reward model (RM) on human preference data, and (2) using that RM as a proxy reward to fine‑tune a language model (the policy) via policy‑gradient methods such as PPO, RLOO, or GRPO. Historically, the quality of an RM has been measured almost exclusively by its accuracy—the proportion of output pairs that the RM ranks in the same order as the (unobservable) ground‑truth reward. Recent empirical work, however, has shown that higher accuracy does not always translate into a better final policy, suggesting that accuracy alone is an insufficient metric.

The paper “What Makes a Reward Model a Good Teacher? An Optimization Perspective” introduces a complementary metric: reward variance. For a given prompt (x) and a policy (\pi_\theta), reward variance is defined as the variance of the RM’s scalar scores over the distribution of outputs sampled from the policy: \

What Makes a Reward Model a Good Teacher? An Optimization Perspective

💡 Research Summary

Comments & Academic Discussion

Leave a Comment