Spectral Superposition: A Theory of Feature Geometry

Spectral Superposition: A Theory of Feature Geometry
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Neural networks represent more features than they have dimensions via superposition, forcing features to share representational space. Current methods decompose activations into sparse linear features but discard geometric structure. We develop a theory for studying the geometric structre of features by analyzing the spectra (eigenvalues, eigenspaces, etc.) of weight derived matrices. In particular, we introduce the frame operator $F = WW^\top$, which gives us a spectral measure that describes how each feature allocates norm across eigenspaces. While previous tools could describe the pairwise interactions between features, spectral methods capture the global geometry (``how do all features interact?’’). In toy models of superposition, we use this theory to prove that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes, classifying all geometries from prior work (simplices, polygons, antiprisms). The spectral measure formalism applies to arbitrary weight matrices, enabling diagnosis of feature localization beyond toy settings. These results point toward a broader program: applying operator theory to interpretability.


💡 Research Summary

The paper tackles the pervasive phenomenon of superposition in neural networks, where far more latent concepts (features) are stored than there are dimensions in the activation space. Traditional interpretability tools such as Sparse Autoencoders (SAEs) decompose activations into sparse linear vectors, but they ignore the geometric relationships that arise when multiple features share the same subspace. To address this gap, the authors introduce a spectral framework built around the frame operator F = WWᵀ, where W ∈ ℝ^{d×f} is the weight matrix of a layer. Unlike the Gram matrix M = WᵀW, which changes under column permutations, F remains invariant, making it a natural object for basis‑independent analysis of feature geometry.

The core idea is to examine the eigenvalues (λₑ) and eigenspaces (Eₑ) of F. Each eigenvalue quantifies how much total norm of the feature set is allocated to its corresponding eigenspace, while the eigenspace itself reveals the directions along which features interfere. The authors define “spectral localization” – the tendency of features to concentrate their energy in a single eigenspace – and prove that when the model approaches capacity saturation (f ≫ d), spectral localization becomes inevitable. In this regime, the collection of features forms a tight frame: the projection operators onto each eigenspace are given by P_C = λ_C⁻¹ ∑_{i∈Ω_C} W_i W_iᵀ, where Ω_C denotes the set of features that share the same eigenspace and λ_C = dim(U_C)/|Ω_C| is the fractional dimensionality.

To connect these spectral observations with combinatorial structure, the paper leverages association scheme theory. Using a toy example of two independent clusters—rock‑paper‑scissors (RPS) with D₃ symmetry and a heads/tails (HT) pair with C₂ symmetry—the authors construct the Bose‑Mesner algebra of each cluster. The adjacency matrices of the clusters decompose into a centroid subspace and a difference subspace, which correspond precisely to the eigenprojectors of the Gram matrix and, via the spectral correspondence lemma, to those of the frame operator. This demonstrates that the eigenstructure of F encodes the same symmetry information that association schemes capture, allowing a unified classification of all previously identified geometries (simplex, polygon, antiprism) under a single spectral signature.

Crucially, the spectral bridge between M and F holds for any weight matrix, not just the toy models. By performing an eigen‑decomposition of F, one can recover the geometry of feature interactions in real networks: features that share an eigenspace can interfere, while those residing in orthogonal eigenspaces are effectively independent. The authors propose quantitative diagnostics such as eigenvalue dispersion, eigenspace participation ratios, and the fractional dimensionality λ_C to measure interference strength, capacity consumption, and the degree of localization. These metrics can guide model compression, regularization, and the design of interventions that minimize unintended side‑effects when editing concepts.

In summary, the paper establishes a rigorous operator‑theoretic foundation for interpreting superposition. It shows that the frame operator’s spectrum provides a basis‑invariant, globally aware description of feature geometry, that capacity limits force features into tight frames localized on single eigenspaces, and that association‑scheme algebra offers a complete taxonomy of possible geometric configurations. The framework opens several avenues for future work: applying spectral diagnostics to large‑scale transformers, developing spectral regularizers to control interference, and extending association‑scheme analysis to more complex relational structures in deep models.


Comments & Academic Discussion

Loading comments...

Leave a Comment