Multiview Self-Representation Learning across Heterogeneous Views
Features of the same sample generated by different pretrained models often exhibit inherently distinct feature distributions because of discrepancies in the model pretraining objectives or architectures. Learning invariant representations from large-scale unlabeled visual data with various pretrained models in a fully unsupervised transfer manner remains a significant challenge. In this paper, we propose a multiview self-representation learning (MSRL) method in which invariant representations are learned by exploiting the self-representation property of features across heterogeneous views. The features are derived from large-scale unlabeled visual data through transfer learning with various pretrained models and are referred to as heterogeneous multiview data. An individual linear model is stacked on top of its corresponding frozen pretrained backbone. We introduce an information-passing mechanism that relies on self-representation learning to support feature aggregation over the outputs of the linear model. Moreover, an assignment probability distribution consistency scheme is presented to guide multiview self-representation learning by exploiting complementary information across different views. Consequently, representation invariance across different linear models is enforced through this scheme. In addition, we provide a theoretical analysis of the information-passing mechanism, the assignment probability distribution consistency and the incremental views. Extensive experiments with multiple benchmark visual datasets demonstrate that the proposed MSRL method consistently outperforms several state-of-the-art approaches.
💡 Research Summary
The paper tackles the problem of learning invariant visual representations from large‑scale unlabeled data when multiple pretrained models (e.g., ResNet, ViT, Swin) generate heterogeneous feature distributions for the same image. Existing unsupervised transfer or contrastive methods assume that different views are generated by the same backbone and therefore share a common distribution; this assumption breaks down with heterogeneous backbones. To address this, the authors propose Multiview Self‑Representation Learning (MSRL), a framework that (1) freezes several pretrained backbones, (2) places a trainable linear projection on top of each backbone to obtain low‑dimensional features, (3) introduces an information‑passing mechanism that uses a self‑representation property to aggregate information from spatially neighboring features, and (4) enforces cross‑view consistency through an Assignment Probability Distribution Consistency (APDC) scheme.
In detail, for each backbone ϕ_l the linear projection σ(·;·) with weight matrix W_l maps the frozen feature ϕ_l(x_i) to h_i^l ∈ ℝ^d. Assuming local smoothness, the set H^l =
Comments & Academic Discussion
Loading comments...
Leave a Comment