CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).
💡 Research Summary
Paper Overview
The authors introduce CodeScaler, an execution‑free reward model (RM) designed to scale both reinforcement‑learning (RL) training and test‑time inference for code generation models. Traditional RL from Verifiable Rewards (RL‑VR) relies on binary feedback obtained by executing generated programs against curated unit tests. While effective, this approach is limited by the scarcity and cost of high‑quality test cases, especially when scaling to large synthetic datasets. CodeScaler replaces the execution step with a learned dense reward that can be computed in a single forward pass, thereby removing the need for test‑case generation and sandbox execution.
Key Contributions
- High‑quality Preference Dataset – The authors curate a pairwise preference dataset from on‑policy rollouts of a Qwen‑3‑8B‑Base model trained with RL‑VR on the DeepCoder benchmark (24 K verified problems). Positive examples are solutions that pass all test cases; negatives are any that fail. To improve robustness, they also add “misaligned” negatives (solutions from other problems).
- Syntax‑aware Code Extraction – Before scoring, CodeScaler extracts a single code block from the model’s response and validates it with an Abstract Syntax Tree (AST) parser. Invalid or fragmented code is replaced with an empty string, preventing the RM from assigning high scores to syntactically broken programs.
- Validity‑Preserving Reward Shaping – Raw Bradley‑Terry scores are transformed with a log‑sigmoid function f(z)=ln(1+e^z) to ensure strictly positive values for valid code while guaranteeing R′(q,ε)=0 for empty/invalid inputs. This creates a clear penalty for malformed code and mitigates reward‑hacking.
- RL Training Integration – CodeScaler is used as the reward signal in a Generalized PPO (GRPO) framework (implemented with V‑RL3). The dense, differentiable reward enables stable policy updates even without binary execution feedback.
- Test‑time Best‑of‑N (BoN) Scaling – At inference, CodeScaler scores each of N candidate solutions and selects the highest‑scoring one, eliminating the need to run unit tests on every candidate. This yields a ten‑fold latency reduction compared with test‑case‑based BoN (e.g., CURE).
Experimental Setup
- Models: Qwen‑3‑8B‑Base and Qwen‑3‑14B‑Base.
- Datasets: DeepCoder (verified) and KodCode (synthetic, ~447 K problems).
- Benchmarks: LiveCodeBench, CodeContests, LiveBench, MBPP, CodeForces – evaluated with Avg@8 (average pass rate over 8 generated solutions).
- Baselines: (a) RL‑VR with binary execution reward, (b) Skywork‑Reward‑V2 (RM), (c) AceCodeRM‑7B (RM).
Results – Training Time Scaling
When training on DeepCoder, CodeScaler‑augmented RL outperforms binary RL‑VR by +1.82 points on average across the five benchmarks (e.g., 36.82 vs. 35.00 on CodeForces for the 8‑B model). On the larger synthetic KodCode set, models trained with CodeScaler close the performance gap that typically exists between synthetic and verified data, demonstrating that the RM can effectively replace execution feedback even when test cases are unavailable.
Results – Test‑time Scaling
Using BoN@8, CodeScaler achieves scores comparable to the unit‑test‑based CURE method (e.g., 75.90 vs. 76.01 on LiveCodeBench) while delivering ≈10× lower latency because no sandbox execution is required. The RM‑based BoN also surpasses off‑the‑shelf RMs by ~5–6 points, confirming the importance of the syntax‑aware extraction and reward shaping.
RM‑Bench Evaluation
Beyond code‑specific tasks, CodeScaler is evaluated on the broader RM‑Bench suite. It improves the code domain by +3.3 points over Skywork‑Reward‑V2‑Qwen‑3‑8B and yields an average +2.7 points uplift across general and reasoning domains, indicating that the training pipeline yields a more universally capable reward model.
Ablation Studies
- Removing syntax‑aware extraction leads to unstable RL training and lower test‑time ranking accuracy.
- Omitting the log‑sigmoid transformation causes the RM to assign positive scores to empty strings, encouraging malformed code generation.
- Excluding misaligned negatives reduces the model’s ability to discriminate between correct and incorrect solutions across different problems.
Limitations & Future Work
CodeScaler currently focuses on Python code and single‑file solutions; extending to multi‑file projects, other programming languages, or more complex software engineering tasks will require richer extraction pipelines and possibly hierarchical reward structures. Moreover, while the preference dataset is high‑quality, it still depends on verified problems; generating equally reliable preferences for entirely novel domains remains an open challenge.
Conclusion
CodeScaler demonstrates that a carefully engineered execution‑free reward model can both scale RL training to large, test‑case‑free datasets and accelerate test‑time inference without sacrificing accuracy. By integrating syntax‑aware extraction, validity‑preserving reward shaping, and high‑quality preference data, the authors achieve consistent improvements over binary execution rewards and existing RMs across multiple benchmarks. This work paves the way for more efficient, scalable, and practical deployment of code generation LLMs in real‑world software development pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment