Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic
The chain of thought, i.e., step-by-step reasoning, is one of the fundamental mechanisms of Transformers. While the design of intermediate reasoning steps has been extensively studied and shown to critically influence performance on mathematical, multi-step reasoning tasks, the ordering of these steps has received little attention, despite its significant effect on the difficulty of reasoning. This study addresses a novel task of unraveling the chain of thought – reordering decoder input tokens into a learning-friendly sequence for Transformers, for learning arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially in sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on seven order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, it recovered the reverse-digit order reported in prior studies for the multiplication task.
💡 Research Summary
The paper investigates a previously overlooked aspect of chain‑of‑thought (CoT) reasoning in Transformer decoders: the order in which the target tokens are generated. While prior work has focused on which intermediate steps to include, the arrangement of those steps can dramatically affect learning, especially for arithmetic tasks where carries and other irreversible operations propagate in a specific direction. The authors formalize the problem as finding a permutation π of the target sequence Y that minimizes the expected loss after training a model on data reordered by π.
To identify “learning‑friendly” permutations, they exploit the well‑known easy‑to‑hard learning dynamics of deep networks: during the early epochs, models quickly reduce loss on easy examples and only later fit harder ones. The proposed pipeline mixes many candidate permutations in a single training run, trains a lightweight Transformer for a small number of epochs (a few thousand batches), and then evaluates the validation loss for each permutation separately. The permutation that yields the fastest loss drop is selected as the optimal order. This approach can handle thousands of permutations simultaneously without prohibitive computational cost.
Because the permutation space grows factorially with sequence length, the authors introduce a two‑stage hierarchical search. In the first (global) stage, the output sequence is divided into fixed‑size blocks (e.g., 4–5 tokens). The order of these blocks is permuted, reducing the search space from L! to B! where B is the number of blocks. In the second (local) stage, each block’s internal token order is refined using the same mix‑train‑evaluate procedure while keeping the block order fixed. This hierarchical scheme enables the discovery of optimal orders for sequences up to length 13 (≈10⁹ permutations) with random initialization, and up to length 40 (≈10⁴⁷ permutations) when the initial candidates are structured.
The authors evaluate seven order‑sensitive arithmetic tasks: addition, subtraction, multiplication, division (quotient + remainder), cumulative ReLU, and two tasks involving irreversible functions (e.g., max, floor). For each task they construct multiple candidate orders, including the natural forward order (most‑significant‑digit‑first), the reverse order (least‑significant‑digit‑first), and random permutations. They also test a naïve soft‑permutation approach that treats the permutation matrix as a continuous variable; this method fails because it creates “information leakage” from future tokens, causing an artificial early loss drop but poor generalization.
Key findings include:
- Multiplication – the reverse‑digit order is automatically rediscovered as the most learning‑friendly permutation, confirming prior heuristic results (Shen et al., 2023). Using this order raises success rates from ~10 % to >95 % on a 100‑sample test set.
- Cumulative ReLU – the forward order is superior; the reverse order forces the model to resolve the irreversible ReLU operation without the benefit of prior context, leading to slower loss reduction and lower final accuracy.
- Addition and Subtraction – relatively insensitive to order, though forward order still shows a slight edge.
- Across all tasks, employing the discovered optimal order improves final accuracy dramatically (often from 10 % to 100 %) and reduces the number of training epochs needed to reach a given performance threshold.
The paper also demonstrates that the hierarchical search scales efficiently: for L = 40, the structured‑initial‑candidate version finds the optimal permutation within a few hours on eight GPUs, a speed‑up of 5–6 orders of magnitude compared with exhaustive search.
In the discussion, the authors note that while empirical results suggest learning‑friendly orders coincide with those that minimize the number of required arithmetic operations (e.g., aligning with carry propagation), a formal theoretical justification remains an open problem. They also acknowledge that their method is static—permutations are fixed before training—and propose future work on dynamic curricula that could adapt the order during training. Limitations include sensitivity to block size for very long sequences and the need to define appropriate tokenizations for non‑numeric or multimodal tasks.
Overall, the study introduces a novel, systematic approach to output‑order optimization for Transformer‑based arithmetic reasoning. By leveraging early‑stage loss dynamics and a two‑stage hierarchical search, it automatically discovers permutations that dramatically improve learning efficiency and final performance. The findings highlight output order as a critical design variable for future work on symbolic reasoning, program synthesis, and other domains where the sequence of intermediate computations matters.
Comments & Academic Discussion
Loading comments...
Leave a Comment