Learning Sparse Visual Representations via Spatial-Semantic Factorization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.

💡 Research Summary

The paper tackles a long‑standing dilemma in self‑supervised visual learning: the “invariance paradox.” Joint‑embedding methods such as DINO enforce invariance to spatial augmentations, which yields strong high‑level semantics but discards the spatial coordinates needed for pixel‑level reconstruction. Conversely, masked image modeling approaches like MAE preserve dense feature grids for reconstruction yet produce weaker semantic representations. The authors argue that this trade‑off is a by‑product of the dense 2‑D grid representation itself.

STELLAR (Sparse Token Extraction and Localization with Low‑rank Approximation Reconstruction) proposes to factorize an image’s latent representation into two low‑rank matrices: a semantic matrix S (size r × d) containing r learnable concept embeddings, and a localization matrix L (size n × r) that assigns each of the n patches a convex combination of the concepts. The full representation is Z = L S, a low‑rank approximation of the conventional n × d dense feature map. By design, spatial transformations affect only L, while S remains approximately invariant. This separation allows the authors to apply DINO‑style augmentation alignment solely on S, preserving semantic invariance, while still using Z for high‑fidelity reconstruction.

To shape the semantic tokens, the method introduces K prototype vectors and projects each token onto the unit sphere. Sinkhorn‑Knopp balancing yields soft assignments q, and a clustering loss L_cluster encourages tokens to form distinct, balanced semantic clusters. For view‑to‑view alignment, the authors formulate an optimal‑transport problem between token sets of two augmentations, solving it with an entropy‑regularized Sinkhorn algorithm to obtain a fast matching matrix P. The matched pairs are then aligned with a cross‑entropy loss L_align, enforcing invariance of S across augmentations. An additional KoLeo regularizer maximizes pairwise distances between tokens from the same image, reducing redundancy. The total training objective combines reconstruction loss, clustering loss, alignment loss, optional class‑token losses, and KoLeo regularization, each weighted by hyper‑parameters.

Implementation uses a ViT‑Base backbone with r learnable query vectors that produce the sparse tokens S. The dense patch features U are projected and cosine‑similarity‑softmaxed (with a spatial temperature) against S to generate L, effectively a single‑head cross‑attention map.

Empirically, with only 16 tokens the model achieves an FID of 2.60, indicating high‑quality image reconstruction, and a linear probing accuracy of 79.1 % on ImageNet‑1K, matching or surpassing dense baselines. Ablation studies varying the token count show that performance saturates around 16–32 tokens, confirming the efficiency of the factorized representation. Analyses of L and S under controlled pixel shifts and random crops demonstrate that L changes equivariantly while S remains stable, validating the theoretical separation of invariance and equivariance.

The contribution is twofold: (1) a novel sparse, low‑rank latent space that disentangles “what” (semantic concepts) from “where” (spatial distribution), and (2) a self‑supervised training scheme that jointly optimizes reconstruction and semantic alignment within this space. By breaking the reliance on dense grids, STELLAR resolves the invariance paradox, offering a memory‑ and compute‑efficient backbone that can be directly applied to downstream tasks such as detection, segmentation, and multimodal vision‑language models.

Learning Sparse Visual Representations via Spatial-Semantic Factorization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment