Language Model Circuits Are Sparse in the Neuron Basis

Language Model Circuits Are Sparse in the Neuron Basis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state’), and can be steered to change the model’s output. This work thus advances automated interpretability of language models without additional training costs.


💡 Research Summary

The paper challenges the prevailing assumption that individual neurons in large language models (LLMs) are too distributed to serve as a sparse, interpretable basis for circuit tracing. By focusing on the pre‑down‑projection hidden activations of the MLP layers—rather than the post‑non‑linear MLP outputs—the authors demonstrate that these raw activation vectors already constitute a highly sparse feature space, comparable to the representations learned by sparse autoencoders (SAEs).

The authors construct an end‑to‑end pipeline that (1) computes attribution scores for each candidate unit (neuron or SAE feature) using gradient‑based methods, (2) selects the top‑k units to form a candidate circuit, and (3) evaluates the circuit’s faithfulness (how well the complement of the circuit reproduces the original model’s behavior) and completeness (how much performance degrades when the circuit itself is ablated). Two attribution methods are examined: the classic Integrated Gradients (IG) and the more recent Relevance Propagation (RelP). RelP replaces nonlinearities with linear approximations during the backward pass, allowing a single‑pass, low‑variance estimate of each unit’s contribution while preserving the original forward computation.

Experiments are conducted on Llama 3.1‑8B (and supplementary Gemma 2 B/9 B) across two benchmark families. The first is the subject‑verb agreement (SVA) suite from Marks et al. (2025), comprising four templated sub‑tasks (simple, rc, within‑rc, noun‑pp). The second is a multi‑hop reasoning task (“city → state → capital”) introduced by Lindsey et al. (2025). For each task, a training set of 300 original–counterfactual pairs is used to discover circuits, and a held‑out set of 40 pairs evaluates them.

Key findings:

  1. MLP activations yield dramatically smaller circuits than MLP outputs. Across all SVA sub‑tasks, achieving > 90 % faithfulness with < 100 neurons is possible, whereas MLP‑output‑based circuits require on the order of 10⁴ units. This 100‑fold reduction closes most of the gap between raw neurons and SAE features.

  2. RelP further narrows the gap. When attribution is computed with RelP, neuron‑based circuits match SAE‑based circuits in sparsity and performance, surpassing IG‑based neuron circuits which remain slightly less sparse. RelP’s single‑backward‑pass efficiency also makes it far more practical for large models.

  3. Circuit interpretability improves. In the city‑state‑capital task, the authors identify two distinct neuron clusters: one that encodes the mapping “city → state” and another that encodes “state → capital”. Manipulating the activation of these clusters (e.g., injecting high activation values) reliably steers the model to produce the correct capital, confirming that individual reasoning steps are localized to small neuron groups.

  4. SAE limitations are highlighted. Learned dictionaries introduce reconstruction error, polysemanticity, and an extra error term that must be treated as an additional node. Consequently, SAE‑based circuits are less faithful to the original model than raw‑neuron circuits, even though they may be slightly more compact.

The paper’s contributions are threefold: (a) empirical evidence that the raw MLP activation basis is intrinsically sparse; (b) a demonstration that RelP provides a fast, accurate attribution method for neuron‑level circuit discovery; and (c) concrete examples of steering model behavior by intervening on tiny neuron subsets, without any additional training or fine‑tuning.

Implications are significant for interpretability, safety, and controllability of LLMs. Researchers can now extract compact, high‑fidelity circuits directly from pretrained models, reducing computational overhead and eliminating the need for auxiliary training of autoencoders or transcoders. This opens pathways for automated debugging (identifying failure‑inducing circuits), targeted safety interventions (silencing harmful circuits), and fine‑grained model steering (activating desired reasoning pathways). Future work may extend the methodology to other architectures (e.g., GPT‑4, PaLM), larger scales, and more complex tasks such as mathematical reasoning or code generation, further solidifying neuron‑level circuit tracing as a cornerstone of trustworthy AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment