Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant breakthroughs in complex LLM reasoning within verifiable domains, such as mathematics and programming. Recent efforts have sought to extend this paradigm to open-ended tasks by employing LLMs-as-a-Judge to provide sequence-level rewards for policy optimization. However, these rewards are inherently sparse, failing to provide the fine-grained supervision necessary for generating complex, long-form trajectories. Furthermore, current work treats the Judge as a black-box oracle, discarding the rich intermediate feedback signals encoded in it. To address these limitations, we introduce Grad2Reward, a novel framework that extracts dense process rewards directly from the Judge’s model inference process via a single backward pass. By leveraging gradient-based attribution, Grad2Reward enables precise token-level credit assignment, substantially enhancing training efficiency and reasoning quality. Additionally, Grad2Reward introduces a self-judging mechanism, allowing the policy to improve through its own evaluative signals without training specialized reward models or reliance on superior external Judges. The experiments demonstrate that policies optimized with Grad2Reward achieve outstanding performance across diverse open-ended tasks, affirming its effectiveness and broad generalizability.
💡 Research Summary
The paper “Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open‑Ended LLM Reasoning” tackles a fundamental limitation of current reinforcement‑learning‑for‑language‑models (RL‑LM) approaches applied to open‑ended tasks. Existing methods that use an LLM‑as‑a‑Judge provide only a binary, sequence‑level reward after the entire response is generated. This sparse signal offers no fine‑grained guidance for the many intermediate steps that are crucial in tasks such as medical counseling, scientific Q&A, or creative writing, and it discards the rich intermediate information that the Judge implicitly computes while processing the output token‑by‑token.
Grad2Reward proposes to unlock that hidden information by performing a single backward pass through the frozen Judge model. For each generated token (a_t) with embedding (e_t), the gradient of the log‑probability of the Judge’s final decision token (z) with respect to (e_t) is computed: (g_t = \nabla_{e_t}\log p_{\text{judge}}(z|x,o,c)). The inner product (b_t = g_t^\top e_t) quantifies the first‑order contribution of token (t) to the final judgment. After normalizing these contributions with a softmax (temperature (\tau)), we obtain attribution weights (\alpha_t). The original sequence‑level reward (r(x,o)) is then distributed token‑wise as (r_t = \alpha_t \cdot r(x,o)). This transformation turns a single binary signal into dense, token‑level rewards without any extra model training or multiple forward passes.
A second key contribution is the “self‑judging” mechanism. Instead of relying on a larger, external Judge, the authors freeze a copy of the initial policy model and use it as the Judge throughout training. Because LLMs tend to be stronger discriminators than generators, this frozen Judge provides stable, consistent feedback while the policy improves, eliminating the need for costly external models and avoiding knowledge leakage from a superior teacher.
The authors also provide a theoretical justification: a first‑order Taylor expansion of the Judge’s log‑probability shows that the sum of the (g_t^\top e_t) terms approximates the total change in the Judge’s output relative to a zero‑embedding baseline. Hence each token’s attribution can be interpreted as a legitimate credit assignment for the overall reward.
To exploit the token‑level rewards, the paper extends the Gradient‑based Policy Optimization (GRPO) algorithm to the token level. For each token, future rewards are summed to form a return (R_{i,t}); these returns are normalized across a batch of sampled responses to compute a token‑wise advantage (\hat A_{i,t}). This “Token‑level GRPO” allows the policy to receive distinct learning signals for each token, addressing the coarse granularity problem of traditional sequence‑level RL.
Empirical evaluation spans both verifiable domains (mathematics, programming) and truly open‑ended domains (medical advice, scientific QA, creative writing). Across all benchmarks, Grad2Reward‑trained policies converge in far fewer training steps and achieve higher final scores than baselines that use only sparse sequence rewards or that rely on separately trained Process Reward Models (PRMs). Notably, the method matches or surpasses PRM‑based approaches without the need to collect ground‑truth process annotations, demonstrating superior scalability and generality.
In summary, Grad2Reward introduces a novel, computationally cheap way to extract dense supervision from an LLM‑as‑a‑Judge via gradient‑based attribution, couples it with a self‑judging framework to remove dependence on external judges, and adapts RL optimization to token‑level signals. The result is a powerful, broadly applicable approach that markedly improves the efficiency and quality of open‑ended LLM reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment