Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model’s recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.
💡 Research Summary
Efficient reinforcement finetuning (RFT) has emerged as a powerful technique for aligning large language models (LLMs) with task‑specific goals, especially in domains such as mathematics where correctness can be precisely measured. However, conventional RFT pipelines suffer from severe sample and compute inefficiencies: they repeatedly generate rollouts, compute binary rewards, and update policies, often spending large amounts of computation on problems that are either trivially easy or hopelessly hard for the current model. This leads to slow convergence and limits scalability to competition‑level math datasets.
The paper introduces AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a lightweight extension to any standard RL algorithm (the authors instantiate it with Proximal Policy Optimization, PPO). The core idea is to maintain a target difficulty T that reflects the difficulty level the model should be training on at any moment. Each training step proceeds as follows: (1) compute the absolute difference Δ_i = |d_i – T| for every problem i in the dataset D, where d_i is a pre‑computed difficulty score; (2) select the B examples with the smallest Δ_i, forming a batch X that is closest to the current target difficulty; (3) generate model responses G = π_θ(X) and compute the average binary reward R_avg across the batch; (4) update the policy π_θ using the chosen RL algorithm; (5) adjust the target difficulty via
T′ = clip( T + η·tanh( α·(R_avg – β ) ), d_min, d_max ).
Here η is the step size, α controls sensitivity, β is the desired reward level (target success rate), and the tanh‑clip combination ensures smooth, bounded updates.
The authors provide a theoretical justification for setting β ≈ 0.5. In entropy‑regularized RL, the KL‑divergence between the optimal policy and the initial policy is lower‑bounded by the reward variance. For binary rewards, this variance is maximized when the success probability p = 0.5, implying that training on problems the model solves about half the time yields the strongest learning signal. Empirical ablations confirm that β = 0.5 consistently delivers the best performance.
A critical component is the difficulty estimator. The authors use a mid‑size math‑specialized LLM (Qwen 2.5 Math 7B) to generate 128 rollouts per problem and define d_i = 100·(1 – s_i/128), where s_i is the number of successful attempts. They show that even with as few as 64 rollouts the estimated difficulty remains within ±0.05 of the full‑sample estimate for >90 % of problems, dramatically reducing the cost of constructing the curriculum. Correlations with human‑annotated AoPS difficulty levels further validate the estimator’s reliability.
Experiments span a large competition‑level math benchmark (AMC, AIME, IMO) containing 12,500 problems across five difficulty tiers. AdaRFT is evaluated on two model sizes (≈7 B and ≈13 B parameters) and compared against vanilla RFT, RAFT, GRPO, and static staged curricula. Results demonstrate up to a 2× reduction in wall‑clock training time and a 3–5 percentage‑point boost in final accuracy. The gains are especially pronounced in imbalanced data regimes where static sampling fails to present appropriately challenging examples.
The paper also discusses limitations: the approach relies on the quality of pre‑computed difficulty scores; noisy or biased scores could misguide the curriculum. Moreover, the current formulation assumes binary rewards, so extensions to partial credit or multi‑step hint systems are needed for broader applicability. Future work is suggested on online difficulty re‑estimation, multi‑objective reward shaping, and generalization to other reasoning domains such as code generation or scientific problem solving.
In summary, AdaRFT offers a simple yet effective adaptive curriculum mechanism that aligns the difficulty of training examples with the model’s evolving capabilities, thereby accelerating convergence and improving final performance without altering the underlying RL algorithm or reward function. This makes it a practical tool for researchers and engineers seeking scalable reinforcement finetuning of LLMs for complex, structured reasoning tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment