Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision
Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models that can solve complex problems previously out of reach. Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence. This makes reasoning length an important scaling knob. In this work, we identify a novel phenomenon (both theoretically and experimentally): under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length (e.g., the token budget in RL, or the loop count in looped Transformers) increases, even after in-distribution (ID) performance has saturated. This suggests that robustness may require a larger budget than ID validation alone would indicate. We provide theoretical explanations via two mechanisms: (i) self-iteration can induce a stronger inductive bias in the hypothesis class, reshaping ID-optimal solutions in ways that improve OOD generalization; and (ii) when shortcut solutions that work for ID samples but not for OOD samples persist in the hypothesis class, regularization can reduce the learned solution’s reliance on these shortcuts as the number of self-iterations increases. We complement the theory with empirical evidence from two realizations of scaling training-time reasoning length: increasing the number of loops in looped Transformers on a synthetic task, and increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning.
💡 Research Summary
This paper identifies and rigorously analyzes a novel phenomenon in training large language models (LLMs): under outcome-only supervision, scaling the reasoning length during training can lead to continued improvements in out-of-distribution (OOD) generalization, even after in-distribution (ID) performance has saturated.
The core premise is that modern methods for enhancing LLM reasoning—such as reinforcement learning (RL) fine-tuning to produce long chain-of-thought (CoT) or using architecturally recurrent models like looped Transformers—effectively introduce a “reasoning length” knob. The authors investigate what happens when this knob is turned up during the training process itself, rather than just at inference time. They frame this as increasing the number of “self-iterations” the model is allowed to perform before producing a final output.
The central finding is that this increase in training-time self-iteration changes the effective hypothesis class that the learning algorithm searches over. This shift can select for different solutions that are equally optimal on the ID training/validation data but behave differently under distribution shift. The paper provides two primary theoretical mechanisms for why this often benefits OOD performance:
- Inductive Bias Strengthening: Self-iteration imposes structural constraints (e.g., requiring the function to be a k-th compositional root). This can reshape the set of ID-optimal solutions, favoring those that align better with the underlying task structure and thus generalize more robustly, even if the base class is already highly expressive.
- Shortcut Suppression: When “shortcut” solutions that work on ID data but not OOD data exist in the hypothesis class, allowing more iterations can act as a regularizer. It reduces the learned solution’s reliance on these superficial shortcuts, pushing the model toward more fundamental, generalizable algorithms.
The theory is substantiated with two concrete empirical demonstrations:
- Scaling Latent Reasoning: On synthetic algorithmic tasks using looped Transformers, ID accuracy saturated at a small number of loops, while OOD accuracy continued to improve gradually as the loop count was increased further.
- Scaling Explicit Reasoning: In RL fine-tuning for mathematical reasoning, increasing the maximum token budget for CoT generation led to OOD accuracy (on unseen problem topics) improving well beyond the point where ID accuracy (on training topics) plateaued.
The practical implication is significant: in scenarios where models are trained with outcome supervision (e.g., RLHF, process reward models) and face potential distribution shift at deployment, it can be highly beneficial to allocate a larger reasoning budget (more CoT tokens or more loops) during training than what seems necessary based on ID validation performance alone. ID performance can be a misleading indicator, masking substantial gains in robustness and generalization that are unlocked by further scaling training-time reasoning length.
Comments & Academic Discussion
Loading comments...
Leave a Comment