How Does Unfaithful Reasoning Emerge from Autoregressive Training? A Study of Synthetic Experiments
Chain-of-thought (CoT) reasoning generated by large language models (LLMs) is often unfaithful: intermediate steps can be logically inconsistent or fail to reflect the causal relationship leading to the final answer. Despite extensive empirical observations, a fundamental understanding of CoT is lacking–what constitutes faithful CoT reasoning, and how unfaithfulness emerges from autoregressive training. We study these questions using well-controlled synthetic experiments, training small transformers on noisy data to solve modular arithmetic expressions step by step, a task we term Arithmetic Expression Reasoning. We find that models can learn faithful reasoning that causally follows the underlying arithmetic rules, but only when the training noise is below a critical threshold, a phenomenon attributable to simplicity bias. At higher noise levels, training dynamics exhibit a transition from faithful stepwise reasoning to unfaithful skip-step reasoning via an intermediate mixed mode characterized by a transient increase in prediction entropy. Mechanistic analysis reveals that models learn to encode internal uncertainty by resolving inconsistent reasoning steps, which suggests the emergence of implicit self-verification from autoregressive training.
💡 Research Summary
This paper investigates why and how large language models (LLMs) sometimes generate chain‑of‑thought (CoT) reasoning that is not faithful to the underlying logical rules, focusing on the emergence of unfaithful reasoning from standard autoregressive (next‑token) training. To obtain a clean experimental setting, the authors design a synthetic task called Arithmetic Expression Reasoning (AER). In AER, each training example consists of a short modular‑arithmetic expression chain of the form “a × b − c → d − c → o”, where all operations are performed modulo a prime N = 97. The chain explicitly represents a two‑step computation: first compute an intermediate result (f₁) and then compute the final answer (f₂).
During data generation the authors inject two independent noise parameters: ε₁ corrupts the first operand(s) of the prompt expression, and ε₂ corrupts the intermediate expression. This mimics the noisy, ambiguous nature of real‑world corpora. They train small transformers (3 layers, 2 heads, 128‑dimensional embeddings) from scratch on 2 million examples, varying ε₁ and ε₂ across a grid. The same qualitative behavior persists when the model size, training set size, or modulus N are changed.
Two notions of faithfulness are formalized. (1) Consistency‑based faithfulness simply checks whether the generated chain (e₁, e₂′, e₃′) exactly matches the ground‑truth chain (e₁, e₂, e₃). Metrics RIR₁ and RIR₂ quantify the proportion of correct reasoning steps and correct solutions, respectively. (2) Intervention‑based faithfulness is stronger: the authors replace the intermediate token e₂ with a random token ẽ₂ and observe how the distribution over final answers changes. Two derived metrics are used: Interventional Distribution Sensitivity (IDS), the average KL divergence between the original and intervened answer distributions, and Interventional Non‑Flip Rate (INR), the fraction of cases where the most likely answer does not change after intervention. Large IDS and small INR indicate genuine step‑by‑step reasoning; the opposite pattern signals “skip‑step” reasoning, where the model essentially ignores the intermediate token and predicts directly from the prompt.
A third metric, Prediction Entropy (PE), measures the model’s uncertainty over the final answer given the generated intermediate token. PE is tracked throughout training to reveal dynamic changes in uncertainty.
Key Findings
-
Noise Threshold for Faithfulness – When ε₂ exceeds roughly 0.15–0.20, both consistency metrics deteriorate sharply, while ε₁ has a much weaker effect. This demonstrates a clear noise‑induced phase transition: low‑noise regimes allow the model to learn faithful, stepwise reasoning; higher noise forces the model to abandon the intermediate step.
-
Four Training Phases – The authors identify four distinct phases in the learning dynamics, visualized by PE and the faithfulness metrics:
P₀ – the model learns the overall token format;
P₁ – genuine step‑by‑step reasoning emerges (low PE, high IDS, low INR);
P₂ – a mixed regime where PE spikes temporarily, IDS rises, and INR falls, indicating that the model is grappling with conflicting information and beginning to encode internal uncertainty;
P₃ – skip‑step reasoning dominates (high PE, low IDS, high INR).The transient entropy peak in P₂ suggests the emergence of an implicit self‑verification process: the model detects inconsistency between the prompt and the corrupted intermediate token and adjusts its answer distribution accordingly.
-
Simplicity Bias as Theoretical Explanation – The authors argue that the observed transition can be understood through the lens of algorithmic simplicity bias. In low‑noise settings, the simplest program that fits the data is the explicit composition f₂ ∘ f₁ (stepwise). When noise corrupts the intermediate token, the loss surface flattens, making the shorter “direct mapping” f(e₁) (skip‑step) equally optimal. Because the training objective does not penalize longer programs, the model gravitates toward the simplest (shortest) representation that still achieves low loss.
-
Implicit Self‑Verification – The mixed P₂ phase exhibits both high IDS (the model’s answer distribution is sensitive to interventions) and a rise in PE, which the authors interpret as the model internally representing uncertainty about the corrupted intermediate step and attempting to reconcile it before committing to a final answer. This behavior emerges without any explicit meta‑learning or self‑reflection prompts, indicating that standard autoregressive training can give rise to a rudimentary form of self‑verification.
-
Implications for Real‑World LLMs – Although the experiments use a toy modular‑arithmetic task and tiny transformers, the authors contend that the same mechanisms likely operate in large‑scale LLMs trained on noisy web data. Consequently, evaluating CoT faithfulness should go beyond consistency checks and incorporate intervention‑based tests to detect whether the model truly relies on its generated reasoning.
Broader Impact and Future Work
The paper highlights a concrete risk: an LLM may produce a plausible chain of reasoning that appears consistent but is actually a post‑hoc justification for a directly guessed answer. This undermines trust in model explanations, especially in safety‑critical domains. The findings suggest that improving data quality (reducing ε₂‑type noise) and possibly encouraging simplicity‑biased learning (e.g., via curriculum learning or explicit regularization) could promote faithful stepwise reasoning.
Future directions include scaling the synthetic setup to deeper reasoning chains, integrating natural‑language prompts, and testing whether explicit self‑verification objectives amplify the emergent uncertainty‑encoding behavior observed in the mixed phase.
In summary, the study provides a rigorous, controllable framework for dissecting how unfaithful CoT reasoning arises from autoregressive training, identifies a noise‑driven phase transition governed by simplicity bias, and uncovers an implicit self‑verification dynamic manifested as a temporary entropy surge. These insights advance our theoretical understanding of LLM reasoning and offer practical guidance for building more transparent and trustworthy AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment