Group Contrastive Learning for Weakly Paired Multimodal Data
We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing key limitations in existing evaluation strategies. This framework includes novel simulations that systematically vary shared versus modality-specific perturbation effects enabling principled assessment of method robustness. Our combinatorial benchmarking reveals that there is not yet an aligner that uniformly dominates across settings or modality pairs. Across simulations and two real single-cell genetic perturbation datasets, GROOVE performs on par with or outperforms existing approaches for downstream cross-modal matching and imputation tasks. Our ablation studies demonstrate that GroupCLIP is the key component driving performance gains. These results highlight the importance of leveraging group-level constraints for effective multi-modal representation learning in scenarios where only weak pairing is available.
💡 Research Summary
The paper introduces GROOVE, a semi‑supervised framework for learning joint representations from weakly paired multimodal single‑cell perturbation data. In this setting, samples from different modalities (e.g., imaging, RNA‑seq, protein measurements) share a common perturbation label but are not observed from the same cell, so there is no instance‑level pairing. Existing contrastive methods either require paired data (CLIP) or operate only on a single modality with class labels (SupCon), leaving a gap for weakly paired multimodal learning.
To fill this gap the authors propose GroupCLIP, a novel group‑level contrastive loss. For an anchor embedding from modality m, all embeddings from the opposite modality that carry the same perturbation label are treated as positives, while all other embeddings are negatives. The loss follows the InfoNCE formulation, normalizing over the full set of candidates from the opposite modality and using cosine similarity (or a t‑distribution kernel) with a temperature τ. Balanced under‑sampling ensures each mini‑batch contains equal numbers of samples per label, preventing class‑imbalance bias.
GroupCLIP is embedded within an on‑the‑fly back‑translation autoencoder architecture inspired by unsupervised neural machine translation. Each modality has its own encoder‑decoder pair, and a shared linear coupling layer projects both latent spaces into a common representation. Training proceeds in two stages per iteration: (1) a weighted sum of GroupCLIP and reconstruction loss (α·L_GroupCLIP + β·L_reconstruction) is minimized, encouraging label‑driven alignment while preserving modality‑specific information; (2) a back‑translation loss (β·L_backtranslation) is minimized, where a sample is encoded, decoded into the opposite modality, re‑encoded, and decoded back, thereby enforcing cross‑modal entanglement of the latent space.
A major contribution is the comprehensive evaluation framework. The authors generate synthetic datasets that systematically vary the proportion of shared (perturbation‑driven) versus modality‑specific signal, allowing controlled assessment of robustness. They also benchmark a suite of optimal‑transport based aligners (e.g., Gromov‑Wasserstein OT, Sinkhorn, entropic OT) in combination with each representation learner, creating a combinatorial matrix of “representation × aligner”. This reveals that no single aligner dominates across all settings, underscoring the need for strong representations.
Experiments on two real single‑cell perturbation datasets—(1) gene expression paired with high‑content imaging, and (2) gene expression paired with surface protein measurements—show that GROOVE matches or exceeds state‑of‑the‑art methods for cross‑modal matching and imputation. Notably, ablation studies demonstrate that removing GroupCLIP dramatically degrades performance, confirming that the group‑level contrastive signal is the primary driver of success. Sensitivity analyses on α, β, and τ further illustrate the importance of balancing contrastive and reconstruction objectives.
In summary, GROOVE advances multimodal single‑cell integration by (i) introducing a principled group‑level supervised contrastive loss that works without instance‑level pairs, (ii) coupling this loss with a back‑translation autoencoder to produce a well‑mixed latent space, and (iii) providing a rigorous, combinatorial benchmarking pipeline. The work opens avenues for extending weakly paired learning to additional modalities (e.g., ATAC‑seq, metabolomics) and for handling noisy or incomplete label information in future studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment