Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.


💡 Research Summary

This paper reframes mechanistic interpretability (MI) as a statistical estimation problem rather than a deterministic circuit‑discovery task. The authors begin by examining causal mediation analysis (CMA), the theoretical backbone of most MI methods. They show that the natural indirect effect (NIE) score for a single edge, computed on a single input‑corruption pair, is not a fixed property but a random variable whose value depends on how the input and the counterfactual perturbation are sampled. Consequently, exact CMA scores exhibit high intrinsic variance even before any aggregation.

Because exact CMA is computationally prohibitive, the community relies on fast approximations such as Edge Attribution Patching (EAP) and its integrated‑gradient variants (EAP‑IG). The paper demonstrates that these approximations introduce additional estimation noise that compounds the base variance of CMA. Four concrete estimators—EAP, EAP‑IG (inputs), EAP‑IG (activations), and a simple clean‑corrupted gradient average—are evaluated, and each is shown to amplify variance to varying degrees.

The authors then treat circuit discovery as a downstream statistical pipeline. Global importance μ_e is defined as the expectation of the local NIE scores over the joint distribution of inputs and perturbations. In practice, μ_e is estimated from a finite dataset, and a selection function A (parameterized by sparsity thresholds, connectivity constraints, etc.) extracts a subgraph C. The paper proves that small fluctuations in the estimated scores ˆμ_e can be dramatically magnified by the selection step, leading to structurally divergent circuits. Empirical evidence on GPT‑2‑small for the “IOI” task shows that varying multiple hyper‑parameters simultaneously yields a wide spread of circuits, with low Jaccard similarity and no clear clustering in an MDS projection.

To dissect the sources of instability, the authors isolate three categories: (i) sampling variance, measured via bootstrap resampling of the dataset; (ii) distributional shifts, examined through paraphrased prompts and alternative counterfactual generation strategies; and (iii) methodological sensitivity, captured by changes in hyper‑parameters and heuristics. Each factor alone produces noticeable variance, and their combination leads to the pronounced instability observed in the final circuits.

The paper concludes with a set of best‑practice recommendations aimed at turning MI into a rigorous scientific discipline. These include systematic bootstrap‑based variance estimation, reporting confidence intervals for importance scores, cross‑validating circuits across multiple perturbation schemes, and incorporating uncertainty into the circuit‑selection process (e.g., Bayesian model averaging or ensemble methods). By adopting these practices, researchers can quantify and communicate the reliability of discovered mechanisms, moving beyond point estimates toward statistically robust explanations. The work thus provides the first comprehensive variance analysis of MI pipelines and a roadmap for more reproducible mechanistic interpretability research.


Comments & Academic Discussion

Loading comments...

Leave a Comment