Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

It is well understood that different neural network architectures are suited to different tasks, but is there always a single best architecture for a given task? We compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of tasks we term Compositional Reasoning Questions (CRQ). This family captures multi-step problems with tree-like compositional structure, such as evaluating Boolean formulas. We prove that under standard hardness assumptions, \emph{none} of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We then provide constructions for solving CRQs with each architecture. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. For transformers with chain of thought, our construction uses $n$ CoT tokens for input size $n$. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

💡 Research Summary

The paper introduces a formal problem class called Compositional Reasoning Questions (CRQs) to study the reasoning capabilities of large language models. A CRQ is defined as a rooted tree where each node carries a fixed‑dimensional vector label. Leaves output their own label, while each internal node computes the answer by taking the arg‑max over the inner products between its label and the answers of its children. This formulation captures tree‑structured tasks such as Boolean formula evaluation, which is NC¹‑complete, and more generally any hierarchical composition of sub‑problems.

The authors compare three model families: (1) transformers of varying depth, (2) recurrent neural networks (RNNs) with bounded hidden dimension, and (3) shallow (constant‑depth) transformers that are allowed to generate a sequence of “Chain‑of‑Thought” (CoT) tokens. For each family they ask: what resource must scale with the input size n in order to solve all CRQs of size n, and how does this scaling affect parallelism, runtime, and parameter count?

Transformers.
Theorem 4.1 shows that a transformer with depth L can solve any CRQ whose tree depth ≤ L. The construction uses one layer per tree level, allowing all nodes at the same depth to be processed in parallel. The required embedding dimension is O(d + log n), essentially to encode positional information. Conversely, Theorem 4.3 (building on known TC⁰ lower bounds for constant‑depth circuits) proves that a transformer whose depth does not grow with n cannot solve arbitrary CRQs. Thus, depth is both necessary and sufficient for transformer expressivity on this task. The trade‑off is clear: deeper transformers need more layers (log n in the worst case) but enjoy high parallel runtime O(L) and modest parameter growth O(L log² n).

RNNs.
RNNs are analyzed under two input ordering regimes. If the input sequence respects the tree’s hierarchical order (e.g., a breadth‑first or depth‑first traversal that presents all children after their parent), Theorem 5.4 provides an algorithm that uses a hidden state of size O(log n) and constant depth to compute the CRQ answer. The hidden state stores a compact summary of the partial results, and each new token updates this summary using the parent‑child relationship. However, Theorem 5.2 shows that if the input order is adversarial—i.e., children may appear before their parent—then any RNN solving all CRQs must have hidden dimension Ω(n). Hence, RNNs trade memory for sequential flexibility: they are extremely parameter‑efficient when the data is well‑ordered but become memory‑heavy otherwise. Their runtime is inherently sequential O(n), with limited parallelism.

Chain‑of‑Thought Transformers.
CoT prompting allows a transformer to emit intermediate tokens that act as an external memory. Theorem 6.1 proves that a constant‑depth transformer equipped with O(n) CoT tokens can solve any CRQ, essentially by writing the answer of each node as a separate token and then using later tokens to combine them. In contrast, using only O(log n) CoT tokens is insufficient. Thus, the “depth” resource can be traded for a linear number of generated tokens. This yields a model with a tiny parameter count (shallow network) but a runtime proportional to the number of CoT tokens, i.e., O(n), and very low parallelism because tokens are generated autoregressively.

Empirical Trade‑off Table.
Table 1 summarizes the three architectures: deep transformers minimize parallel runtime but require logarithmic depth; RNNs minimize parameter count but need ordered inputs and have sequential runtime; CoT transformers minimize depth but pay a linear token‑generation cost. The table also lists parameter scaling: O(L log² n) for deep transformers, O(log n) hidden dimension for ordered RNNs, and O(log n) parameters for CoT transformers (the bulk of the cost is in the generated tokens).

Implications.
The work challenges the notion that a single architecture dominates across all reasoning tasks. Instead, it shows that for a natural class of hierarchical problems, each architecture can be optimal under different resource constraints. The analysis highlights the importance of encoding structural information (e.g., positional encodings that reflect tree topology) and suggests that practical LLM systems could benefit from task‑specific prompting or architectural tweaks that align with the underlying problem structure.

Conclusion.
By formalizing CRQs and rigorously proving upper and lower bounds for transformers, RNNs, and CoT‑augmented transformers, the paper maps a rich complexity landscape for compositional reasoning. Depth, hidden dimension, and the number of CoT tokens emerge as interchangeable resources that trade off parallelism, parameter efficiency, and runtime. This theoretical framework provides a foundation for future work on architecture‑aware prompting, hardware‑aware model design, and the development of more efficient reasoning systems that respect the structural nature of the tasks they solve.

Compositional Reasoning with Transformers, RNNs, and Chain of Thought

💡 Research Summary

Comments & Academic Discussion

Leave a Comment