Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

💡 Research Summary

The paper tackles the longstanding instability of self‑interpretation methods that ask large language models (LLMs) to describe their own internal activations in natural language. Prior approaches are highly sensitive to a scaling hyper‑parameter, often producing fluent but semantically empty explanations. Rather than fine‑tuning the entire model—risking that explanations reflect the fine‑tuned weights rather than the original representations—the authors freeze the LLM completely and train only a tiny adapter that maps activation vectors to token embeddings. The training data consist of “interpretability artifacts”: pairs of vectors (either sparse auto‑encoder decoder vectors or contrastive activation vectors) and human‑readable labels derived from prior mechanistic interpretability work.

Four adapter families are explored, ranging from the trivial identity to full‑rank affine transforms. The central finding is that a scalar affine adapter, f(h)=α·h+b, with only d + 1 parameters (where d is the model’s hidden dimension), is sufficient to achieve dramatic gains. The bias term alone accounts for roughly 85 % of the improvement over the untrained baseline. Low‑rank extensions add modest benefits, while full‑rank adapters catastrophically overfit on sparse‑auto‑encoder data, essentially learning a high‑dimensional lookup table that destroys the geometric structure of the activation space.

Experiments span three datasets: (1) Goodﬁre SAE features (≈45 k vectors) with auto‑interpretability labels, (2) Llama‑Scope SAE features (multiple widths) with labels from Neuronpedia, and (3) Wikipedia contrastive vectors (≈50 k topics) paired with synthetic descriptions. On SAE data, the scalar affine adapter reduces validation cross‑entropy from near‑zero to 1.787 and raises generation‑scoring hit rates from 63 % to 71 % at the 70 B scale. On contrastive vectors, a full‑rank adapter reaches 82.9 % recall@1 (vs. 0.04 % for the untrained SelfIE) and 98.4 % recall@100. A “best‑of‑6” protocol pushes recall@1 to 93.7 %.

A scaling analysis using the Qwen‑2.5 family (7 B → 72 B) shows that self‑interpretation performance improves faster than a “Taboo” baseline where the model describes topics without naming them. While the Taboo ceiling saturates early, the trained adapters continue to climb, indicating that larger models contain richer latent semantics that become accessible through the lightweight adapter.

Beyond pure labeling, the adapters enable novel capabilities: in multi‑hop reasoning tasks they can surface “bridge entities” that never appear in the prompt or answer, effectively revealing implicit reasoning steps without requiring chain‑of‑thought prompting. Moreover, the adapters dramatically reduce scale‑sensitivity; after training, a simple grid of six fixed scales suffices, whereas prior methods required per‑latent scale tuning.

The authors release code and pretrained adapters, and demonstrate cross‑family generalization (e.g., adapters trained on Wikipedia vectors perform reasonably on SAE latents). Simpler adapters consistently generalize better than more expressive ones, underscoring the importance of preserving the identity structure of the activation space.

In summary, the work introduces a paradigm shift: interpretability artifacts are repurposed as supervision to teach frozen LLMs to interpret themselves. With only d + 1 trainable parameters, the method yields reliable, scalable self‑interpretation, outperforms prior prompting‑only baselines, and scales positively with model size—all without altering the underlying model. This opens a practical pathway for integrating mechanistic insights directly into model‑generated explanations.

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment