Bridging Functional and Representational Similarity via Usable Information
We present a unified framework for quantifying the similarity between representations through the lens of \textit{usable information}, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we prove that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.
💡 Research Summary
This paper proposes a unified theoretical framework that bridges two traditionally separate notions of similarity in deep neural networks—functional similarity and representational similarity—by grounding both in the concept of usable information. Usable information, as defined by Xu et al. (2020), measures how much of the statistical dependence between a representation Z and a target Y can actually be exploited by a downstream predictor belonging to a restricted hypothesis class V. Formally, the V‑conditional entropy H_V(Y|Z) is the minimal cross‑entropy loss achievable by any predictor in V, and usable information is I_V(Z→Y)=H_V(Y|∅)−H_V(Y|Z). This formulation makes the information measure directional and dependent on the capacity of V (e.g., linear, shallow, or deep neural networks).
Functional similarity is defined via a Markov blanket condition: two representations Z₁ and Z₂ are functionally similar with respect to a task Y if I(Z₂;Y|Z₁)=I(Z₁;Y|Z₂)=0. The authors connect this definition to model stitching, a practical technique where a learned “stitcher” s maps Z₁ into the space of Z₂ so that a fixed head q_φ₂ can predict Y with the same cross‑entropy loss as when using Z₂ directly. Proposition 3.5 and Corollary 3.6 prove that, when q_φ₂ is Bayes‑optimal for Z₂, perfect stitchability of Z₁ into Z₂ is equivalent to Z₁ being a Markov blanket for Y relative to Z₂. Consequently, functional similarity requires two well‑performing stitchers (Z₁→Z₂ and Z₂→Z₁); a single directional stitcher only establishes a one‑way information flow and does not guarantee full functional equivalence.
Representational similarity is defined analogously but with the input X as the target: Z₁ and Z₂ are representationally similar if I(X;Z₂|Z₁)=I(X;Z₁|Z₂)=0, meaning they retain exactly the same information about the raw input. The paper shows that many widely used similarity metrics—Centered Kernel Alignment (CKA), Representational Similarity Analysis (RSA), and Procrustes alignment—are in fact estimators of usable information under specific choices of V. For example, CKA corresponds to V being the class of linear (or kernel‑induced) predictors, RSA to rank‑based similarity under isotropic transformations, and Procrustes to orthogonal linear maps. Thus, these metrics are not arbitrary geometric distances but reflect the amount of information that can be extracted by a predictor of a given expressive power.
A central theoretical contribution is the hierarchy of similarity based on task granularity. Proposition 3.8 demonstrates monotonicity: if two representations are functionally similar for a complex task Y, they remain functionally similar for any coarser deterministic function Y′=g(Y). Since input reconstruction (Y=X) is the finest possible task, representational similarity is a special case of functional similarity (Remark 3.9) and, by Corollary 3.10, guarantees functional similarity for all deterministic downstream tasks. The implication is strict: functional similarity does not imply representational similarity because a model may discard task‑irrelevant nuisance information while still preserving the required predictive signal.
Empirically, the authors evaluate several convolutional and transformer architectures on three axes: (i) bidirectional stitching performance, (ii) CKA/RSA scores, and (iii) reconstruction loss. The experiments confirm that (a) bidirectional stitchability aligns with high functional similarity, (b) under a linear predictor family, CKA correlates strongly with stitching performance, and (c) when V includes deep nonlinear models, additional usable information is uncovered that standard geometric metrics miss. These findings validate the theoretical claims and illustrate how the choice of V controls what aspects of similarity are measured.
Overall, the paper reframes functional and representational similarity as points on a continuum governed by the capacity of the downstream predictor. By interpreting existing metrics through the lens of usable information, it provides a principled basis for comparing models, assessing transferability, and even relating artificial networks to biological neural representations. Future work may explore efficient estimation of I_V for richer hypothesis classes, extend the hierarchy to non‑deterministic tasks, and apply the framework to cross‑modal or neuro‑computational studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment