Decomposing multimodal embedding spaces with group-sparse autoencoders

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn “split dictionaries”, where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.

💡 Research Summary

The paper tackles a fundamental limitation of applying sparse autoencoders (SAEs) to jointly‑aligned multimodal embeddings such as CLIP (image‑text) and CLAP (audio‑text). While SAEs have become a popular tool for uncovering human‑interpretable “concepts” in single‑modality embeddings, recent work has shown that when trained on multimodal spaces they tend to learn a “split dictionary”: most dictionary vectors are active only for one modality, even though the underlying embeddings are aligned. This split‑dictionary phenomenon hampers cross‑modal manipulation, retrieval, and interpretability because the same semantic information is represented by disjoint latent codes in different modalities.

The authors first provide a theoretical argument: if a split dictionary exists on an aligned space, then there also exists a non‑split dictionary that achieves the same reconstruction loss but with strictly better multimodal alignment. This shows that the problem is not a failure of the Linear Representation Hypothesis (LRH) itself, but rather an implicit bias of standard SAE training, which optimizes only reconstruction error without any structural constraints linking modalities.

To overcome this bias, the paper introduces a novel SAE architecture that combines two key mechanisms:

Cross‑modal random masking – For each paired sample (e.g., an image and its caption), a random subset of dimensions in one modality’s embedding is masked before encoding. This forces the encoder to rely on features that are robust across modalities, encouraging the same dictionary atoms to be used for both sides of the pair.
Group‑sparse regularization – The sparse codes of the two modalities for a given pair are treated as a single group. An ℓ₂,₁ norm is added to the loss, penalizing the sum of ℓ₂ norms of each group. This promotes simultaneous activation of the same dictionary atoms across modalities, effectively tying the latent representations together.

The overall loss therefore consists of (i) reconstruction L₂ loss, (ii) a sparsity‑promoting term (e.g., TopK or ℓ₁), (iii) the group‑sparse ℓ₂,₁ penalty, and (iv) a consistency term induced by the random masking. The encoder still uses a linear projection followed by a sparsifying function (TopK in the experiments), and the decoder is a linear reconstruction matrix (the dictionary).

The authors evaluate the method on two large‑scale multimodal models: CLIP (image‑text) and CLAP (audio‑text). They introduce a new quantitative metric, Multimodal Monosemanticity Score (MMS), which measures how often a single neuron co‑activates for semantically similar inputs from different modalities, using cosine similarity from a separate multimodal encoder as a proxy for semantic similarity. Additional metrics include the proportion of dead neurons, zero‑shot cross‑modal retrieval accuracy, and human judgments of concept interpretability.

Results show that the proposed Group‑Sparse SAE (GS‑SAE) dramatically improves multimodal alignment: MMS scores rise from near zero (standard SAE) to around 0.7, indicating that many dictionary atoms now fire for both image and text (or audio and text) representations of the same concept. The number of dead neurons drops by more than 30 %, and zero‑shot retrieval performance improves by 5–7 % absolute over the baseline SAE. Human evaluation confirms that the learned concepts are more readily assignable to intuitive labels (e.g., “dog”, “classical music”) compared with the baseline. Ablation studies reveal that both random masking and group‑sparsity are necessary; removing either component reduces the gains.

Beyond empirical improvements, the paper contributes a rigorous theoretical insight about the existence of better‑aligned dictionaries, a set of novel evaluation tools for multimodal SAEs, and a practical recipe that can be applied to any over‑complete linear autoencoder architecture. The work opens avenues for more controllable multimodal generation, concept‑based editing across modalities, and deeper interpretability of large‑scale vision‑language or audio‑language models. Future directions include scaling to video‑text embeddings, exploring non‑linear encoders, and integrating the learned dictionaries into downstream tasks such as prompt engineering or multimodal reinforcement learning.

Decomposing multimodal embedding spaces with group-sparse autoencoders

💡 Research Summary

Comments & Academic Discussion

Leave a Comment