DAJ: Data-Reweighted LLM Judge for Test-Time Scaling in Code Generation

DAJ: Data-Reweighted LLM Judge for Test-Time Scaling in Code Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Test-time scaling for code generation commonly relies on Best-of-N selection, in which multiple candidate solutions are sampled from a base model, and the best one is selected by an LLM judge. However, training reliable LLM judges is challenging due to severe distribution shifts, including imbalances between easy and hard problems, mismatches between training tasks and evaluation benchmarks, and trajectory mismatch arising from training data generated by cheaper models whose behavior differs from that of inference-time models. We propose DAJ, a reasoning-based LLM judge trained with verifiable rewards under a bi-level data-reweighted learning framework. The proposed framework learns data-importance weights (either domain-level or instance-level) to optimize generalization performance on a held-out meta set aligned with target benchmarks. To the best of our knowledge, this is the first application of data reweighting to LLM-as-a-Judge training for test-time scaling. Our approach automatically emphasizes hard problems, in-distribution samples, and trajectory-aligned data, without relying on hand-crafted heuristics. Empirically, DAJ achieves state-of-the-art performance on LiveCodeBench and BigCodeBench, outperforming strong test-time scaling baselines as well as leading proprietary models.


💡 Research Summary

Test‑time scaling (TTS) for code generation often relies on a Best‑of‑N approach: a base model generates multiple candidate solutions, and a separate “judge” model selects the best one. While recent work has shown that large language models (LLMs) can serve as effective judges by producing reasoning traces, training such judges remains difficult because the training data typically differ from the distribution encountered at inference time. The authors identify three major sources of distribution shift: (1) an easy‑hard imbalance where easy problems dominate the training set, (2) a task‑distribution mismatch between training tasks and the target benchmark, and (3) a trajectory mismatch because training candidates are produced by cheaper, weaker models, whereas inference‑time candidates come from stronger models.

To address these challenges, the paper introduces DAJ (Data‑Reweighted LLM Judge), a novel framework that automatically reweights training examples along three complementary axes: difficulty, domain similarity, and trajectory alignment. Each training sample i receives an importance weight w_i = w_i^diff × w_i^domain × w_i^traj. Difficulty weights up‑sample hard problems, domain weights favor samples whose tasks are similar to the target benchmark, and trajectory weights prioritize examples whose candidate solutions resemble those generated by strong inference‑time models. The weights are not hand‑crafted; instead they are learned through a bi‑level optimization process. The lower‑level optimizes the judge’s parameters ϕ using a weighted loss (either preference‑optimization loss or reinforcement‑learning loss with verifiable rewards), while the upper‑level evaluates the judge on a held‑out meta‑set that mirrors the target distribution and updates the weights to maximize meta‑set performance. This bi‑level scheme ensures that the weighting scheme directly reflects downstream generalization.

The judge itself follows a reasoning‑based LLM‑as‑a‑Judge paradigm. Given a problem statement and a pair of candidate solutions, the model is prompted with “Let’s think step by step…” and generates a multi‑step reasoning trace before emitting a final selection. Because each candidate can be executed against test cases, the correctness of the selection is automatically verifiable. The authors define a simple reward function R: 1 for a correct selection, 0.5 for an incorrect selection, and 0 for a format error. This verifiable reward enables reinforcement learning with verifiable rewards (RL‑VR) without any human‑annotated reasoning data. In the preference‑optimization variant, correct selections are treated as positive examples and all others as negatives, collapsing the three‑level reward into a binary preference signal.

Experiments are conducted on two large code‑generation benchmarks: LiveCodeBench and BigCodeBench (including a hard split). DAJ is compared against strong baselines, including existing Best‑of‑N judges (e.g., OpenAI GPT‑4, Claude 2, LLaMA‑2‑70B), recent test‑time scaling methods (KRAFTON AI & SKT AI, Snell et al.), and a baseline GRPO‑based judge without reweighting. DAJ consistently outperforms all baselines, achieving the highest overall scores on the official leaderboards. The gains are especially pronounced on hard problems, where the difficulty reweighting component drives most of the improvement. Ablation studies show that each of the three weighting dimensions contributes positively, and the combination yields the largest boost. The authors also compare domain‑level versus instance‑level weighting, finding that instance‑level weighting provides finer granularity but domain‑level weighting remains effective and simpler to implement.

The paper’s contributions are threefold: (1) introducing a bi‑level data‑reweighting framework for LLM‑as‑a‑Judge training, which systematically addresses distribution shifts without hand‑crafted heuristics; (2) leveraging verifiable rewards to train the judge via either preference optimization or reinforcement learning, eliminating the need for costly human annotations; (3) demonstrating state‑of‑the‑art performance on two major code‑generation benchmarks, surpassing both open‑source and proprietary models.

Limitations include the need for a separate high‑quality meta‑set to drive the upper‑level optimization and the computational overhead of bi‑level training. The current implementation focuses on pairwise comparisons; scaling to larger candidate pools may require more efficient multi‑candidate voting schemes. Future work could explore meta‑learning techniques to accelerate weight learning, extend the framework to multi‑candidate voting, and apply the same reweighting principles to other generation domains such as mathematics or natural‑language generation. Overall, DAJ provides a principled, scalable solution for improving LLM judges in test‑time scaling scenarios, and its methodology is likely applicable beyond code generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment