CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning
Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty–aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state-of-the-art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT-3.5, while delivering over $20\times$ faster inference than competitive baselines. Code is available at https://github.com/LIGHTCHASER1/CVeDRL.git
💡 Research Summary
The paper tackles the critical bottleneck of code verification in large‑language‑model (LLM) based code generation pipelines. Existing supervised fine‑tuning (SFT) approaches for code verifiers suffer from three major drawbacks: scarcity of high‑quality unit‑test data, high error/failure rates of generated tests, and severe inference inefficiency due to the need to sample many candidate tests. To overcome these limitations, the authors explore reinforcement learning (RL) as a post‑training strategy that can directly optimize execution‑driven rewards without relying on large labeled datasets.
Initial experiments reveal that naïve RL using only a binary functionality reward (pass/fail) fails to produce effective unit tests for hard‑to‑reach branches and for complex code samples. To provide a principled foundation, the authors introduce a “unit‑test majority‑voting” framework and derive a confidence bound that links three quantities: the per‑candidate pass probability (p), the average branch coverage (c), and the prior probability that a candidate is functionally correct (q). The bound shows that maximizing p·c·q directly improves the reliability of the majority‑voting selection, thereby justifying a multi‑dimensional reward design.
Guided by this theory, the authors propose CVeDRL, a code verifier trained with a composite reward consisting of (1) a syntax reward that enforces proper AST‑based formatting (e.g., presence of a unittest.TestCase subclass inside a Python code block) and (2) a functionality reward derived from execution outcomes: –2 points for runtime errors, –1.5 points for logical failures, and a positive term equal to the coverage rate for passed tests. These rewards are integrated into Group Reward Policy Optimization (GRPO), which updates the policy using a clipped surrogate objective with KL‑divergence regularization.
Recognizing that a linear coverage reward leads the model to focus on “happy‑path” tests, the authors introduce two difficulty‑aware mechanisms:
- Branch‑difficulty‑aware RL – Coverage reward is reshaped from linear to exponential:
r_cov = (exp(α·cov) – 1) / (exp(α) – 1)
where cov ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment