TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning
Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model’s proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.
💡 Research Summary
TACLer introduces a novel reinforcement‑learning framework that simultaneously tackles two major inefficiencies in large language model (LLM) reasoning: the high computational cost of long chain‑of‑thought (CoT) generation and the phenomenon of “overthinking,” where models produce redundant intermediate steps. The approach consists of (i) a model‑tailored curriculum learning scheme that dynamically adjusts data difficulty based on the model’s current proficiency, and (ii) a hybrid reasoning paradigm that allows the model to operate in either a Thinking mode (full CoT with
Curriculum learning is performed by first running the current model on the entire training set with greedy decoding (8 k context). Each instance is classified into three groups: (1) correct final answer, (2) full reasoning generated but final answer wrong, and (3) reasoning truncated due to length limits. Difficulty is defined by the model’s ability to solve each problem, rather than by static heuristics such as input length. During the first two curriculum stages the training data is composed of a balanced mix of groups 1 and 2, avoiding overwhelming the model with overly hard samples while still providing enough challenge. This process is repeated twice, after which a final stage trains on the full dataset to consolidate knowledge.
The hybrid reasoning mode is trained jointly across all stages. In Thinking mode the model is prompted to emit a detailed reasoning trace inside
Training uses Group Relative Policy Optimization (GRPO). For each question, a group of G responses is sampled from the old policy π_old; the new policy π_θ is updated by maximizing a clipped importance‑sampling objective that incorporates the minimum of the new and old probabilities, scaled by an advantage term A_i. Advantages are computed as normalized rewards (binary 1 for correct final answer, 0 otherwise). The authors remove the KL‑penalty to allow freer policy updates and increase the upper clipping bound ε_high while fixing ε_low, promoting exploration and preventing premature entropy collapse.
Experiments employ a 1.5 B parameter DeepSeek‑R1‑Distill‑Qwen backbone, trained on the DeepScaleR dataset (≈40 k math problems from AIME competitions). Evaluation is conducted on four challenging math benchmarks: MATH500, AMC, AIME 2024, and AIME 2025. Baselines include long‑CoT models (STILL‑3, DeepScaleR, FastCuRL) and efficiency‑focused models (OverThink, DAST, O1‑Pruner, TLMRE, ModelMerging, AdaptThink, AutoThink).
Results show that TACLer achieves the highest accuracy in Thinking mode on three of four datasets (average 88.4%) and is only marginally behind DeepScaleR on the remaining set. In NoThinking mode it reaches 88.2% accuracy, surpassing the best prior efficient model (AutoThink, 83.8%) by 4.4 percentage points. Token usage drops dramatically: Thinking mode average length is reduced from ~3 k–8 k tokens (≈42% reduction) and NoThinking mode average length is ~2 k–6 k tokens (≈49% reduction) compared with baseline long‑CoT models. Training compute is cut by more than 50% relative to the standard long‑CoT training pipeline, thanks to the curriculum that prevents wasteful processing of truncated samples.
The paper’s contributions are: (1) a proficiency‑driven curriculum that tailors difficulty to the model’s current abilities, dramatically improving learning efficiency; (2) a hybrid Thinking/NoThinking inference framework that balances accuracy and computational cost while preserving user control; (3) an enhanced GRPO‑based RL training loop with specific clipping and KL‑removal tricks; and (4) extensive empirical validation across multiple math reasoning benchmarks demonstrating state‑of‑the‑art performance and efficiency.
Limitations include the reliance on a binary reward signal that does not capture the quality of intermediate reasoning, sensitivity to curriculum hyper‑parameters (number of stages, mixing ratios), and evaluation confined to a 1.5 B model. Future work could explore richer multi‑dimensional rewards, automated curriculum scheduling, scaling to larger models, and application to other domains such as code generation or scientific reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment