LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation
Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.
💡 Research Summary
LoopFormer addresses a fundamental limitation of existing looped Transformers: the rigid, pre‑specified number of recurrence steps that prevents flexible compute allocation at inference time. By treating each iteration as a point on a continuous trajectory, the model conditions every loop on two scalar variables – the normalized cumulative time t (0 ≤ t ≤ 1) and the step size Δt (the fraction of the unit interval covered by the current iteration). These scalars are embedded with sinusoidal positional encodings, passed through small MLPs, and used to modulate RMSNorm scaling factors (γ₁, γ₂) and gating coefficients (α₁, α₂) applied to the residual connections of the multi‑head self‑attention (MHSA) and feed‑forward (FFN) sub‑layers. This “time‑step conditioning” makes the shared Transformer block explicitly aware of where it lies on the trajectory and how coarse or fine the current update is, enabling the same parameters to behave consistently across both fine‑grained (many small steps) and coarse‑grained (few large steps) schedules.
Training proceeds on a family of trajectories rather than a single fixed schedule. For a maximum depth L, the algorithm samples a shortcut length S (1 ≤ S < L) and a corresponding step schedule Δ_S that still satisfies Σ_i Δ_i = 1. The loss combines three terms: (i) the standard next‑token cross‑entropy on the full L‑step trajectory (L_L), (ii) the same loss on the sampled shortcut trajectory (L_S), and (iii) a shortcut‑consistency loss (L_cons) that aligns the token logits of the shortcut trajectory with stop‑gradient logits from the full trajectory. This consistency term forces shorter trajectories to predict the same final representation that the full trajectory would reach, thereby preventing representation collapse when fewer loops are used. Hyper‑parameters λ₁ and λ₂ weight the shortcut and consistency losses; the authors report λ₁ = λ₂ ≈ 0.1 as a stable setting.
Architecturally, LoopFormer retains the simplicity of a single shared decoder‑only Transformer block (k = 1 or 2 layers) and augments it with the time‑step modulation described above. The design draws inspiration from diffusion models (e.g., DiT’s adaLN) and recent shortcut‑distillation work, but uniquely incorporates the step‑size Δt and trains across multiple trajectories, which is essential for elastic‑depth inference.
Empirically, a 1.3 B‑parameter LoopFormer was evaluated on language modeling datasets (WikiText‑103, C4) and zero‑shot reasoning benchmarks (GSM‑8K, ARC‑Easy, BoolQ). Key findings include:
- Compute‑efficiency – At 30 % of the maximum loop budget, perplexity is within 5 % of the full‑budget model, outperforming non‑looped baselines with comparable FLOPs.
- Smooth scaling – Performance improves almost linearly with the number of loops; there is no abrupt degradation when the budget is reduced, unlike fixed‑L looped models that exhibit severe collapse at short depths.
- Reasoning robustness – On reasoning tasks, even low‑budget runs achieve competitive zero‑shot scores, and full‑budget runs close the gap to state‑of‑the‑art looped Transformers.
To understand representation dynamics, the authors measured CKA similarity, curvature, entropy, and anisotropy across loop steps. Standard looped models showed stagnation (high similarity, low curvature) as depth increased, whereas LoopFormer maintained evolving representations, higher curvature, and stable entropy, confirming that the shortcut‑consistency loss preserves a non‑degenerate trajectory.
In summary, LoopFormer introduces three synergistic components—time‑step conditioning, trajectory‑wide shortcut consistency, and budget‑conditioned inference—that together enable looped Transformers to adapt their computational depth on the fly without retraining. This opens a practical path toward cost‑aware large language models and suggests that similar trajectory‑based conditioning could benefit other domains such as vision or multimodal modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment