SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

SO T Align: Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Simon Roschmann * 1 2 3 4 Paul Krzakala * 5 6 Sonia Mazelet 6 Quentin Bouniot 1 2 3 4 Zeynep Akata 1 2 3 4 Abstract The Platonic Representation Hypothesis posits that neural networks trained on different modal- ities con verge tow ard a shared statistical model of the world. Recent work exploits this con vergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastiv e losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achiev ed with substantially less supervision. W e introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image–text pairs together with large amounts of unpaired data. T o address this challenge, we propose SOT Align, a two-stage framew ork that ﬁrst recovers a coarse shared geometry from limited paired data using a linear teacher , then reﬁnes the alignment on unpaired samples via an optimal-transport-based div er- gence that transfers relational structure without ov erconstraining the target space. Unlike existing semi-supervised methods, SO T Align ef fectively lev erages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and signiﬁcantly outperforming supervised and semi-supervised baselines. 1. Introduction V ision-language models (VLMs) learn a shared embedding space for images and text, enabling zero-shot transfer to unseen concepts and domains. Since CLIP ( Radford et al. , 2021 ) and ALIGN ( Jia et al. , 2021 ), the dominant paradigm has relied on large-scale contrasti ve training on paired image-text data, with performance improving predictably as supervision scales. While ef fective, this approach requires * Equal contribution 1 Helmholtz Munich 2 T echnical Uni versity of Munich 3 Munich Center for Machine Learning 4 Munich Data Science Institute 5 T ´ el ´ ecom Paris 6 ´ Ecole Polytechnique. Corre- spondence to: Simon Roschmann < simon.roschmann@tum.de > . Pr eprint. F ebruary 27, 2026. 1 a cat in the grass a blue car is driving -1 -1 -1 -1 people are celebrating a shopping mall animals in the zoo 1 0 0 0 1 0 0 0 1 “a cat sitting in the grass” “a running brown dog” “a blue car is driving” “people are celebrating” “a shopping mall” “animals in the zoo” -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 Few pairs Unpaired images F igure 1. Semi-Supervised V ision-Languag e Alignment. W e tackle the alignment of frozen unimodal encoders where paired data (red blocks) is scarce but unpaired data is abundant. The key challenge is: how to deﬁne a training signal for unpaired data when ground- truth cross-modal correspondences are missing? hundreds of millions of paired samples ( Cherti et al. , 2023 ), making VLMs costly to train and difﬁcult to adapt when paired data are limited. This issue arises in many critical applications, such as specialized scientiﬁc, medical, or industrial domains, where collecting lar ge-scale annotations is expensi ve, time-consuming, or infeasible. In this paper , we in vestigate vision–language alignment be- yond large-scale supervision, asking whether meaningful alignment can be achie ved from pretrained encoders using only a small number of paired samples together with abun- dant unimodal data. W e posit that, under the Platonic Rep- resentation Hypothesis ( Huh et al. , 2024 ), unimodal models should already encode compatible semantic structures, mak- ing such alignment possible with minimal supervision. SO T Align Overview . W e focus on training lightweight alignment layers on top of pretrained unimodal encoders. In this setting, we ﬁrst demonstrate that meaningful cross- modal alignment can be recov ered from very fe w pair ed sam- ples using simple linear methods, pro viding empirical sup- port for the Platonic Representation Hypothesis. Then, we introduce SO T Align (Semi-supervised Optimal Transport- based Alignment), a simple approach that enables to further reﬁne such alignment by leveraging lar ge unimodal datasets , achieving state-of-the-art results in this semi-supervised setting. SO T Align relies on KLO T , an optimal-transport- based diver gence that enables to transfer the initial geomet- ric structure of the linear teacher while allo wing sufﬁcient 1 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport ﬂexibility to av oid underﬁtting. Critically , we derive the explicit gr adient of KLO T di ver gence, removing the mem- ory bottlenecks that hav e limited the scalability of previ- ously optimal-transport-based alignment methods. In the experimental section, we implement strong supervised and semi-supervised baselines and demonstrate the superior per- formances of SO T Align across a wide range of downstream tasks. W e carefully explore the rob ustness of SO T Align to a variety of f actors such as the number of paired samples, the number of unimodal samples, and the pretrained unimodal models. Critically , we also sho w that SOT Align can le verage samples from multiple sources simultaneously , for example combining unimodal images from ImageNet and captions from CC12M to improve performances on COCO despite a signiﬁcant distribution shift. In short, we make the follo wing contributions: • W e show that meaningful vision–language alignment can be recov ered from very few paired samples using simple linear methods. • W e introduce SOT Align, a semi-supervised approach that lev erages unpaired unimodal data to achiev e state- of-the-art alignment. • W e propose KLO T , a novel optimal-transport-based div ergence, and fully address the memory bottlenecks that plagued prior O T -based methods. • W e v alidate the robustness of SO T Align through exten- siv e experiments across tasks, datasets, and encoders. 2. Related W ork V ision–Language Models. VLMs learn a joint embedding space for images and text, enabling zero-shot transfer across downstream tasks. CLIP ( Radford et al. , 2021 ) established this paradigm through large-scale contrastive pretraining on 400 million image–text pairs, demonstrating that strong alignment can emerge from frozen pretrained encoders. Subsequent work has focused on scaling data and reﬁning contrastiv e objectives to improv e performance. ALIGN ( Jia et al. , 2021 ) le veraged noisy web-scale supervision, while SigLIP ( Zhai et al. , 2023 ) and SigLIPv2 ( Tschannen et al. , 2025 ) introduced alternati ve losses and massi ve multilingual datasets, with SigLIPv2 training on W ebLI, comprising 10 billion images and 12 billion alt-texts across 109 languages. These efforts are consistent with empirical scaling laws observed for CLIP-style models ( Cherti et al. , 2023 ), b ut also highlight a central limitation: achieving state-of-the-art performance r equires millions or billions of pair ed samples , which is impractical in many settings and modalities. More recently , O T -CLIP ( Shi et al. , 2024 ) proposed an Opti- mal T ransport (O T) interpretation of the InfoNCE objectiv e ( Oord et al. , 2018 ), vie wing contrastive learning as inv erse O T with a ﬁxed identity transport plan. W e adopt this per - spectiv e in the present work, but extend it beyond fully supervised settings by allowing target transport plans that are not restricted to the identity . Moreover , we derive an explicit expression for the gradient of the resulting objecti ve (Theorem 5.1 ), removing the memory bottlenecks that hav e limited O T -based approaches to small batch sizes. The Platonic Representation Hypothesis. Huh et al. ( 2024 ) posit that neural networks trained on dif ferent modal- ities, architectures, or objecti ves tend to con verge toward compatible latent representations that reﬂect shared under - lying structure in the data. In the context of vision-language models, this perspecti ve suggests that pretrained unimodal image and text encoders may already produce semantically aligned representations, e ven in the absence of e xplicit cross- modal training. This observation motiv ates an alternative approach to VLM construction, in which the pretrained en- coders are k ept frozen and only lightweight alignment layers are learned to reconcile their representation spaces. Sev eral recent works adopt this paradigm, demonstrating that strong vision–language performance can be achieved by align- ing frozen pretrained unimodal encoders rather than train- ing multimodal models from scratch ( V ouitsis et al. , 2024 ; Zhang et al. , 2025a ; Maniparambil et al. , 2025 ; Huang et al. , 2025 ). Our work follo ws this line of research, b ut focuses on regimes where paired supervision is se verely limited. Low-Supervision Alignment. A growing body of work has explored alignment under weak, limited, or absent su- pervision. In unimodal settings, Jha et al. ( 2025 ) show that text embeddings can be aligned across representation spaces without paired data. Extending this idea to cross- modal alignment, Maniparambil et al. ( 2024 ) and Schnaus et al. ( 2025 ) demonstrate that vision–language representa- tions can also be matched without supervision, b ut rely on quadratic assignment problem solvers that scale only to a few hundred samples, limiting applicability . Closer to our setting, S-CLIP ( Mo et al. , 2023 ) introduces a semi-supervised frame work in which optimal transport de- ﬁnes target similarities between unpaired images and paired captions, with promising results for domain adaptation of CLIP . In contrast, we deﬁne target similarities e ven between unpaired images and unpaired captions, enabling effectiv e use of large-scale unimodal data on both sides (Figure 1 ). SUE ( Y acobi et al. , 2025 ) also considers semi-supervised vision–language alignment, but is limited to a single dataset and a single downstream task. Our work generalizes this setting across tasks, datasets, and encoder combinations. Finally , STR UCTURE ( Gr ¨ oger et al. , 2025 ) augments In- foNCE with a regularization term encouraging preserv ation of unimodal geometry . While ev aluated in supervised set- tings, this idea could in principle le verage unpaired data and is therefore included as a baseline in our experiments. 2 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport 3. Methodology Notations. For u, v ∈ R d , we denote the cosine similarity by k ( u, v ) = ⟨ u,v ⟩ ∥ u ∥∥ v ∥ . W e stack batches of n vectors as matrices in R n × d . Giv en U ∈ R n × d and V ∈ R m × d , we deﬁne the afﬁnity matrix K [ U, V ] ∈ R n × m with entries K [ U, V ] i,j = k ( U i , V j ) . W e deﬁne the (row-wise) Softmax normalization as Softmax ε ( K ) i,j = exp( K i,j / ε ) P n k =1 exp( K i,k / ε ) . 3.1. Problem F ormulation W e denote d x (resp. d y ) the latent dimension of the pre- trained vision (resp. language) encoder . W e consider the problem of learning alignment layers f θ 1 : R d x → R d and g θ 2 : R d y → R d that encode vision and language into a shared space of dimension d , parametrized by θ = ( θ 1 , θ 2 ) . In this setting, the training objecti ve is often formulated as minimizing the div ergence between the geometry of the shared space and a target geometry . Formally , given a dataset of image and language embeddings X ∈ R n × d x and Y ∈ R n × d y , the goal is to minimize L ( θ ; X , Y ) = DIV( K [ f θ 1 ( X ) , g θ 2 ( Y )] || K ∗ [ X, Y ]) , (1) where K ∗ denotes the “target geometry” and DIV is some div ergence between two afﬁnity matrices. In the fully su- pervised setting, the dataset is made of pairs i.e. Y i is the caption of X i and the target similarity is set to the identity K ∗ [ X, Y ] = I n . (2) For instance, the InfoNCE loss ( Oord et al. , 2018 ) − 1 n n X i =1 log exp  K [ f ( X ) , g ( Y )] i,i  P n j =1 exp  K [ f ( X ) , g ( Y )] i,j  , (3) is a special case of ( 1 ) for K ∗ = I n and DIV( K || K ∗ ) = lim ε → 0 KL(Softmax ε ( K ∗ ) || Softmax 1 ( K )) . Thus, the main challenge to e xtend these approaches to unsupervised data is to introduce a tar get K ∗ [ X, Y ] that is deﬁned ev en if we don’t assume that X and Y are pairs. 3.2. Semi-Supervised Setting W e consider a semi-supervised setting with three types of data. First, we observe a small set of pair ed samples ( A, B ) , where A ∈ R n p × d x and B ∈ R n p × d y , and each row of A is aligned with the corresponding ro w of B . In Algorithm 1 SO T Align T raining Require: ( A, B ) , X , Y 1: ( W x , W y ) ← LinearAlignement ( A, B ) 2: Initialize encoders f and g 3: f or i = 1 , . . . , T do 4: Sample X b ∼ X # Sample batch of image embeddings 5: Sample Y b ∼ Y # Sample batch of text embeddings 6: K ∗ ← cosine ( X b W ⊤ x , Y b W ⊤ y ) 7: K ← cosine ( f ( X b ) , g ( Y b )) 8: K p ← cosine ( f ( A ) , g ( B )) 9: L ← SigLIP( K p , I n p ) + α KLOT( K, K ∗ ) 10: Update f , g using ∇L 11: end f or addition, we have access to large collections of unpaired data: unlabeled images X ∈ R n x × d x and unlabeled text Y ∈ R n y × d y . Finally , we assume that the number of supervised pairs is limited, i.e., n p ≪ n x and n p ≪ n y . This setting is motiv ated by two considerations. First, it allows us to study how far supervision can be reduced while still enabling the recovery of a meaningful alignment between modalities, providing a direct test of the Platonic Representation Hypothesis. Second, such a re gime reﬂects many practical scenarios in multimodal learning, where collecting paired data is expensiv e or infeasible and only a small number of aligned samples is av ailable. 3.3. SO T Align W e address this setting with a two step approach: First, we ﬁt a a simple linear alignment model with the supervised pairs ( A, B ) . W e denote W x ∈ R d x × d ′ and W y ∈ R d y × d ′ the linear projections produced by this model. An important ﬁnding of this work is that such linear models already yield surprisingly strong alignment . W e dedicate Section 4 to this fundamental component of our method. Then we use this linear model to re gularize the training of the ﬁnal alignment layers f θ 1 and g θ 2 in a pseudo-labeling fashion. More precisely , we constrain the geometry of the learned shared space to stay close to that produced by the linear teacher . This regularization writes as Ω( θ ; X , Y ) = DIV  K [ f θ 1 ( X ) , g θ 2 ( Y )])     K [ X W T x , Y W T y ]  , (4) where the choice of the di vergence DIV is the second core component of the method and is discussed in Section 5 . Finally , our training loss is L α ( θ ; A, B , X , Y ) = L ( θ ; A, B ) + α Ω( θ ; X, Y ) , (5) where α controls the strength of the regularization. 3 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Image Encoder Image Encoder T ext Encoder T ext Encoder Cosine Sim. Cosine Sim. Supervised Loss Unsupervised Regularization Large image dataset Large text dataset Cosine Sim. Image Encoder T ext Encoder Small pairs dataset Step 1: Fit Linear Alignment Model (Fully Supervised) Step 2: Use it to regularize training (Semi-Supervised) "A running brown dog " "A beautiful cat sitting" "A panda eating bamboo" "A professor teaching" "A bright desk lamp" "People at a party" "A running brown dog " "A beautiful cat sitting" Linear Alignment CCA, Procrustes, or Contrastive F igure 2. SO T Align is a tw o-step method for the alignment of pretrained unimodal image and text encoders. First, we ﬁt a linear alignment model only using the limited amount of av ailable image-text pairs. Then, we use this linear model as a teacher to regularize the training of alignment layers f and g for a joint embedding space leveraging unimodal (unpaired) data. 4. Linear Alignment Model The ﬁrst core component of the proposed method is the lin- ear alignment model. This model is trained with the limited amount of pairs av ailable and then used as a teacher to reg- ularize the training of the full semi-supervised model. As highlighted in the experimental section, such linear models achiev e surprisingly strong alignment performances. W e now discuss a selection of suitable candidates. Procrustes Alignment. In the Orthogonal Procrustes prob- lem, one tries to align two point clouds by looking for the orthogonal transformation that minimizes the RMSE ( Sch ¨ onemann , 1966 ). W e slightly adapt its formulation to our setting by looking for two orthogonal transformations that map the pairs ( A, B ) to a shared space. Formally , ( W x , W y ) = arg max P,Q ⟨ AP ⊤ , B Q ⊤ ⟩ s.t. P P ⊤ = QQ ⊤ = I d ′ (6) This formulation assumes that the data is ﬁrst centered and normalized which is omitted here for the sake of simplicity . Canonical Correlation Analysis. In statistics, CCA is used to ﬁnd a space in which two random variable are max- imally correlated which each others ( Mardia et al. , 2024 ). T ransposing to our setting, the two random v ariables are the text and image embeddings and CCA writes as ( W x , W y ) = arg max P,Q ⟨ AP ⊤ , B Q ⊤ ⟩ s.t. ( AP ⊤ ) ⊤ ( AP ⊤ ) = ( B Q ⊤ ) ⊤ ( B Q ⊤ ) = I d ′ (7) The main dif f erence from Procrustes is that the orthogonality constraint is applied to the shared space directions instead of the transformation itself. The solutions to Equation ( 6 ) and ( 7 ) are provided in Appendix C.1 . Contrastive Learning . Finally , perhaps the most natu- ral choice is to consider a linear projection trained with a classical contrastiv e learning approach, formally ( W x , W y ) = arg min P,Q DIV  K [ AP T , B Q T ]     I n p  , (8) where DIV is the InfoNCE loss deﬁned in equation ( 3 ) or an alternativ e such as the SigLIP loss ( Zhai et al. , 2023 ). 5. Choice of Diver gence The second core component of our method is the choice of a div ergence DIV( K || K ∗ ) , where K, K ∗ ∈ R n × n denote, respectiv ely , the af ﬁnity matrix induced by the trainable shared representation and the target afﬁnity matrix. This section examines se veral possible choices for this diver gence and discusses their respectiv e strengths and limitations. Centered K ernel Alignment. One of the most popular ways to compare af ﬁnity/kernel matrices is Centered Kernel Alignment (CKA) ( Cristianini et al. , 2001 ). Denoting H = I n − 11 T , CKA writes as CKA ( K, K ∗ ) = ⟨ K H, H K ∗ ⟩ p ⟨ K H, H K ⟩⟨ K ∗ H , H K ∗ ⟩ . (9) CKA admits a linear-time computation in the batch size 4 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport n when implemented via k ernel f actorizations (Proposi- tion C.6 ), which constitutes a non-negligible practical prop- erty for alignment methods operating with large batches ( Zhang et al. , 2025a ). Howev er, CKA also suffers from known limitations ( Dav ari et al. , 2022 ) and, more impor- tantly in our setting, enforces a strong constraint of the form K ≈ K ∗ . This can be overly restrictive when K ∗ is only in- tended as a regularizing signal provided by a linear teacher , rather than an exact tar get geometry . Generalized InfoNCE. As highlighted above, it might be beneﬁcial to use a regularization that does not enforce K to be exactly aligned with K ∗ . T o this end, one can consider the generalized InfoNCE loss ( Shi et al. , 2024 ) as it only enforces that arg max j K i,j ≈ arg max j K ∗ i,j . This is achie ved by ﬁrst applying a Softmax on the afﬁnity matrices before comparing them, i.e., InfoNCE( K || K ∗ ) = KL(Softmax ϵ ∗ ( K ∗ ) || Softmax ϵ ( K )) . (10) The classical version is reco vered for ε ∗ → 0 and K ∗ = I n . KLO T . Giv en the pre vious interpretation of InfoNCE, we consider a natural e xtension that seeks the preserv ation of the entire Optimal T ransport (OT) plan instead of only the nearest neighbor . Introducing Π n =  P ∈ R n × n +   P 1 = 1 , P ⊤ 1 = 1  , the O T plan is deﬁned as O T ϵ ( K ) = arg min P ∈ Π n −⟨ P , K ⟩ + ϵH ( P ) , (11) where H ( P ) = ⟨ P, log P ⟩ is the negativ e entropy . Note that it is a natural extension of Softmax as Softmax ϵ ( K ) = arg min P 1 = 1 −⟨ P , K ⟩ + ϵH ( P ) and, similarly to Softmax, setting ϵ = 0 recov ers a strict one-to-one mapping. More details about O T are av ailable in Appendix C.2 . Then, we deﬁne the KLO T diver gence as KLOT( K || K ∗ ) = KL( O T ϵ ∗ ( K ∗ ) || OT ϵ ( K )) . (12) This formulation is similar to that proposed by V an Assel et al. ( 2023 ) for the purpose of dimensionality reduction and generalizes ( Shi et al. , 2024 ) beyond K ∗ = I n . The main limitation of KLO T ( 5 ) is that O T ϵ ( K ) does not admit a closed-form solution. While the Sinkhorn algo- rithm ( Cuturi , 2013 ) provides fast con vergence on GPUs, computing the gradient ∇ K O T ϵ ( K ) remains challenging. Existing approaches rely either on backpropagating through the Sinkhorn iterations by unrolling the algorithm ( Gene vay et al. , 2018 ), which induces a sev ere memory bottleneck, or on implicit differentiation techniques ( Eisenber ger et al. , 2022 ), which signiﬁcantly increase time complexity . W e fully addr ess this limitation by deriving an explicit expres- sion for the gradient (Theorem 5.1 ). F igure 3. GPU memory usage for a batchsize n = 10 k when computing the gradient of the O T -based diver gence with naiv e solver unrolling (blue) and the provided e xplicit gradient formula (orange). Additionnal results are reported in appendix B.1 . Theorem 5.1. F or any transport plan P ∈ Π n , ∇ K KLOT( K || K ∗ ) = O T ϵ ( K ) − OT ϵ ∗ ( K ∗ ) ϵ ∗ . (13) Pr oof is pr ovided in Appendix C.2 . As illustrated in Figure 3 , our approach removes the memory bottleneck inherent to Sinkhorn unrolling and can be up to 50 × faster than implicit dif ferentiation (Figure 6 ). This is a general result that could potentially apply to a range of O T - based methods using similar objectives, including recent approaches in model alignment and contrastiv e learning ( V an Assel et al. , 2023 ; Mo et al. , 2023 ; Shi et al. , 2024 ). 6. Experiments Experimental Setting. W e train all models using a maximum batch size of 32k, composed of up to 10k paired samples and completed with unpaired images and text. W e use the LION optimizer ( Chen et al. , 2023 ) with a cosine annealing learning-rate schedule, a maximum learning rate of 10 − 4 , and a weight decay of 10 − 5 , and train for 2000 iter - ations. For the supervised component of the loss, we employ the SigLIP objective, initializing the logit scale to 20 and the logit bias to − 10 , with both parameters learned during training. Unless otherwise speciﬁed, we use DINOv3 V iT -L ( Sim ´ eoni et al. , 2025 ) and NV -Embed-v2 ( Lee et al. , 2025 ) as the pretrained vision and language encoders, respectiv ely . By default, experiments are conducted with 10k paired samples and, when applicable, up to 1M unpair ed images and texts drawn from CC3M ( Sharma et al. , 2018 ). W e vary the weight of the regularization in Equation ( 5 ) over α ∈ { 10 − 3 , 10 − 4 , 10 − 5 } and select it based on retriev al performance on CC3M, accounting for different ratios of su- pervised to unsupervised data. W e sho w in Section 6.2 that comparable results can be obtained with alternative settings. By default, the metric that we report is the average of the text-to-image (T2I) and image-to-text (I2T) retriev al (Recall@1) performance on the COCO validation set which we denote MeanR@1. 5 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport T able 1. Comparison of linear methods across dif ferent di vergences. Zero-shot retriev al on COCO (MeanR@1). “None” indicates the standalone performances of the linear method. Linear Method Diver gence Procrustes CCA Contrasti ve None 21.1 21.5 24.2 CKA 23.5 23.5 24.2 InfoNCE 23.9 24.1 26.5 KLO T 30.0 30.3 28.5 6.1. Ablation Studies Linear Methods. SO T Align employs a linear teacher which can be ﬁt using the methods described in Section 4 . W e report the standalone zero-shot retriev al performance of these linear models when trained on only 10k image- text pairs from CC3M in the ﬁrst row of T able 1 . Notably , ev en the closed form models (CCA and Procrustes) reach MeanR@1 scores larger than 21%. The linear contrastiv e approach (SigLIP loss, SAIL baseline) reaches 24.2% and will serve as the main supervised baseline. Diver gences. T o align the teacher’ s af ﬁnity matrix with the afﬁnity matrix in the learnable shared space, SO T Align can utilize an y of the three di ver gences detailed in Section 5 . T a- ble 1 displays all combinations of linear methods and div er- gences. Critically , we observ e that the proposed O T -based div ergence KLO T systematically outperforms the classical alternati ves. The best results is achie ved for CCA and KLOT and we select this setting for the next e xperiments. 6.2. Robustness of SO T Align Number of Supervised Pairs. W e ﬁrst analyze how the number of paired image–text samples affects do wnstream performance. In this e xperiment, we ﬁx the amount of un- paired data to 1M images and 1M te xt samples from CC3M, and vary the number of paired examples from 10 2 to 10 5 . Figure 4 reports zero-shot retriev al results. Across all super- vision lev els, SOT Align consistently outperforms the super- vised SAIL baseline, with gains of up to +10% accuracy in the intermediate regime between 10 3 and 10 4 pairs. As ex- pected, these gains diminish as the number of paired samples increases, and both methods fail under extremely sparse su- pervision (100 pairs). Overall, SO T Align reaches the same performances as SAIL with roughly 4 times less supervision. Number of Unsupervised Samples. Next, we in vestigate the effect of unpaired data on do wnstream performance. In this experiment, we ﬁx the number of pairs to 10k, and vary the number of additional unpaired image and te xt samples between 10 4 and 10 6 . Figure 4 shows that our method successfully lev erages unpaired data for zero-shot retriev al. W e observe consistent gains from the unpaired data up to 500k unpaired samples. Unsupervised Data Source. W e further ev aluate our method in a challenging cross-dataset regime where unpaired images and texts originate from entirely dif ferent sources (T able 7 ). Using a ﬁxed set of 10k paired samples from CC3M, we introduce unpaired unimodal data from CC12M, COCO, and ImageNet-1k, as well as synthetic captions. Despite these shifts in the data distributions, our approach consistently outperforms the supervised baseline. Notably , incorporating ImageNet-1k images improv es classiﬁcation performance, while lev eraging COCO samples yields retriev al gains by narrowing the gap to the test distribution. These results demonstrate that our framew ork can effecti vely exploit unpaired data e ven when the visual and textual modalities are drawn from disjoint, heterogeneous corpora. Quantifying the Distribution Shift. Motiv ated by these results, we next seek to quantify the effect of distribution shift and relate it to the observed performance gains. Giv en a source of unpaired data D = ( X, Y ) and the paired dataset D p = ( A, B ) , we deﬁne the distribut ion shift as d( D , D p ) = SSW( X , A ) + SSW ( Y , B ) , (14) where SSW denotes the Spherical Sliced W asserstein dis- tance ( Liu et al. , 2024 ), with details in Appendix C.2 . W e adopt this metric because it is scalable to large datasets, well suited to unit-normalized embeddings, and does not require X and A (or Y and B ) to be aligned. As shown in Figure 5 , this distance strongly correlates with downstream performance: unpaired data that are closer to the paired dis- tribution consistently yield larger performance gains when incorporated during training. Supervised Data Source. W e next study the impact of the paired data source on SO T Align performance (T able 2 ), while ﬁxing the unpaired data to 1M samples from CC3M. V arying the source of the 10k image–text pairs re veals that higher-quality supervision can substantially inﬂuence align- ment. In particular, using CC3M pairs with synthetic cap- tions yields a notable improvement in retrie val performance (+4.8% T2I R@1), suggesting that cleaner te xtual supervi- sion better guides the exploitation of noisy unpaired data. While pairs drawn from the lar ger CC12M corpus improve ImageNet classiﬁcation, the strongest retriev al performance is obtained when using COCO pairs. Unimodal Encoders. In the same v ein, we e xamine the im- pact of the choice of unimodal encoders on zero-shot classiﬁ- cation and retriev al performance. W e ﬁx the training data to the standard setting of 10k paired samples and 1M unpaired samples from CC3M, and vary only the vision and lan- guage encoders. As reported in T able 3 , aligning DINOv3 V iT -L with NV -Embed-v2 yields the strongest downstream performance, achie ving 46.1% accuracy on ImageNet and 26.5% T2I R@1 on COCO. Among the ev aluated vision 6 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport 100 1k 5k 10k 50k 100k Number of pair ed samples 0 10 20 30 Mean R@1 SOT Align (Ours) Supervised S AIL 10k 50k 100k 500k 1M Number of unpair ed samples 24 26 28 30 Mean R@1 SOT Align (Ours) Supervised S AIL F igure 4. Left: Ef fect of the number of paired samples (while ﬁxing 1M unpaired samples). Right: Effect of the number of unpaired samples (while ﬁxing 10k pairs). W e report the zero-shot retrie val (MeanR@1) on COCO. More metrics are reported in Appendix B . 0.00 0.01 0.02 0.03 0.04 0.05 T otal Sliced W asserstein Distance to CC3M 24 26 28 30 32 34 COCO MeanR@1 Pearson r = -0.72 U n p a i r e d i m a g e CC3M CC12M COCO Imagenet 1K U n p a i r e d t e x t CC3M CC3M synth. CC12M COCO W ikiT e xt103 Supervised S AIL Linear fit F igure 5. Relationship between the total sliced W asserstein dis- tance between CC3M image/text dataset and unimodal datasets, and the downstream performance of SO T Align trained on 10k CC3M image–text pairs and up to 1M samples from the corre- sponding unimodal datasets. models, DINOv3 consistently outperforms earlier v ariants, which we attribute to its substantially lar ger pretraining corpus of 1.7 billion images, compared to 142 million for DINOv2. This trend aligns with the Platonic Representation Hypothesis ( Huh et al. , 2024 ), which suggests that as mod- els scale in data and capacity , their representation spaces increasingly conv erge. W e hypothesize that this intrinsic con vergence reduces the alignment gap between modali- ties, thereby facilitating semi-supervised alignment with SO T Align. In Appendix B , we re veal a strong positi ve corre- lation between representational similarity and do wnstream MeanR@1 (Pearson r = 0 . 83 ). 6.3. Benchmarking Semi-Supervised Alignment Baselines. Since the semi-supervised alignment setting we consider is relatively unexplored, there are no established standard baselines. W e therefore compare SOT Align against a range of supervised and semi-supervised methods which we adapt to our setting. The primary supervised baseline is SAIL ( Zhang et al. , 2025a ), trained on paired image–text data with a SigLIP loss; we also propose a semi-supervised vari ant that incorporates unpaired samples as additional neg- T able 2. SOT Align compared to supervised SAIL across different paired datasets (10k pairs) and 1M unpaired samples from CC3M. Zero-shot retriev al on COCO (MeanR@1). Green v alues ( + ) repre- sent the absolute gain. Pair ed data Method ImageNet 1K COCO T2I COCO I2T CC3M SAIL 35.6 21.0 27.4 SO T Align 46.1 +10.5 26.5 +5.5 34.1 +6.7 CC3M SAIL 36.2 28.3 37.2 synthetic SO T Align 46.5 +10.3 31.3 +3.0 43.1 +5.9 CC12M SAIL 38.5 20.1 27.2 SO T Align 47.4 +8.9 26.1 +6.0 36.3 +9.1 COCO SAIL 21.8 30.7 42.4 SO T Align 35.8 +14.0 34.8 +4.1 46.7 +4.3 T able 3. SOT Align trained on 10k pairs and 1M unpaired samples from CC3M with different unimodal encoders. Zero-shot classiﬁ- cation (top-1 acc) and retriev al (R@1). Green values represent the absolute gain ov er supervised SAIL. V ision Model Language Model ImageNet 1K COCO T2I COCO I2T DINOv2 Nemotron-8B 32.4 +6.9 15.5 +3.8 23.3 +5.3 Qwen3-8B 39.5 +7.7 20.9 +4.1 31.1 +7.3 NV -Embed-v2 42.5 +9.8 23.1 +4.1 31.1 +7.7 DINOv3 Nemotron-8B 35.5 +10.1 16.6 +3.8 26.2 +5.2 Qwen3-8B 42.7 +9.3 24.1 +5.0 35.3 +7.3 NV -Embed-v2 46.1 +10.5 26.5 +5.5 34.1 +6.7 ativ es. W e further consider STR UCTURE ( Gr ¨ oger et al. , 2025 ), which regularizes joint embeddings to preserve uni- modal geometry , and e valuate this term using either paired data only or both paired and unpaired samples. In addition, we include pseudo-labeling approaches that construct syn- thetic pairs from similarity distributions, including NNCLR ( Dwibedi et al. , 2021 ) (as used in DeCLIP ( Li et al. , 2022 )) and S-CLIP ( Mo et al. , 2023 ). Finally , we compare against SUE ( Y acobi et al. , 2025 ), a semi-supervised alignment method restricted to retrie val on a single dataset. Full details of all baselines are provided in Appendix A.2 . 7 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport T able 4. Zero-shot image-text retrie val (Recall@1) on COCO and Flickr30. Comparison of supervised and semi-supervised methods trained with 10k image-text pairs and 1M unpaired samples from CC3M. Upper bound with 1M supervised pairs in grey . COCO Flickr30k Method T2I I2T T2I I2T Sup. SAIL (1M) 35.5 45.5 63.1 75.0 SAIL 21.0 27.4 45.7 54.1 STR UCTURE 21.0 28.7 46.8 54.0 Semi-sup. SAIL 20.7 26.5 44.9 53.1 STR UCTURE 20.9 28.0 45.7 56.0 NNCLR 21.3 27.9 46.6 53.0 S-CLIP 20.4 27.8 44.5 52.6 SO T Align (Ours) 26.5 34.1 51.7 60.8 T able 5. Zero-shot image classiﬁcation (top-1 accuracy). Compari- son of supervised and semi-supervised methods trained with 10k image-text pairs and 1M unpaired samples from CC3M. Upper bound with 1M supervised pairs in grey . Method Food-101 CIF AR-10 CIF AR-100 DTD ImageNet Sup. SAIL (1M) 63.9 97.8 82.3 53.5 56.4 SAIL 36.4 96.2 71.2 36.8 35.6 STR UCTURE 38.5 96.7 72.2 39.5 38.2 Semi-sup. SAIL 36.5 96.2 71.3 35.9 35.6 STR UCTURE 37.6 96.5 70.8 38.7 36.8 NNCLR 37.9 96.5 73.0 38.8 37.4 S-CLIP 35.3 95.9 69.3 37.6 36.4 SO T Align (Ours) 50.0 97.5 78.3 42.4 46.1 Zero-Shot Image-T ext Retrieval. W e e valuate SO T Align against these baselines in T2I and I2T retriev al on COCO ( Lin et al. , 2014 ) and Flickr30k ( Plummer et al. , 2015 ), and report the results in T able 4 . In the low-resource regime with 10k image-text pairs, the supervised baseline SAIL reaches 21.0 T2I R@1 and 27.4 I2T R@1. STRUCTURE performs marginally better beneﬁting from its structure-preserv ation objectiv e. Howe ver , both methods fail to exploit unpaired data, either as additional negati ves or for structure preserva- tion. Notably , even the adapted semi-supervised approaches, NNCLR and S-CLIP , are unable to successfully exploit unpaired data. S-CLIP has been originally de veloped for domain adaptation and appears less robust when confronted with the lar ge div ersity of unpaired samples in our setting. Its pseudo-labels are further limited to the small set of paired instances. In contrast, SO T Align successfully leverages the 1M unpaired images and text from CC3M to improve cross- modal alignment. On Flickr30k, our method reaches 51.7 T2I R@1 and 60.8 I2T R@1, yielding gains of +4.9 and +4.8 ov er the strongest baselines, respectively . For compari- son, we include in gray the supervised upper bound obtained by training SAIL with 1M paired examples. T able 6. Alignment per dataset. Follo wing the setup of SUE ( Y a- cobi et al. , 2025 ), we train for alignment per dataset and ev aluate image-text retrie val (Recall@5). COCO Flickr30k Polyv ore 100 Pairs 500 Pairs 500 Pairs Method I2T T2I I2T T2I I2T T2I CSA 1.3 1.0 1.3 0.8 1.3 1.0 Contrastiv e 8.5 5.8 9.5 9.8 13.8 11.5 SUE 21.5 18.3 19.8 22.0 22.8 20.8 SO T Align (Ours) 35.8 35.0 59.8 63.3 55.3 55.3 Zero-Shot Image Classiﬁcation. W e further ev aluate SO T Align in zero-shot classiﬁcation on ImageNet ( Deng et al. , 2009 ) and more ﬁne-grained classiﬁcation datasets. The results are displayed in T able 5 and mirror the trends observed in zero-shot retrie val. STR UCTURE outperforms SAIL in the supervised setting, but is not able to lev erage unpaired data for additional performance gains. Existing semi-supervised methods like NNCLR and S-CLIP do not sho w any impro vements o ver the supervised baselines. Only SOT Align is able to leverage 1M unpaired samples during alignment to improv e zero-shot image classiﬁcation. Our method achie ves an ImageNet top-1 accuracy of 46.1%, which is an improv ement of +7.9 ov er the best baseline. In grey we display the supervised SAIL baseline trained on 1M image-text pairs as an upper bound. Alignment per Dataset. Y acobi et al. ( 2025 ) study semi- supervised vision–language alignment using pretrained encoders, b ut under a substantially simpler setting than ours: their method operates within a single dataset, with paired and unpaired samples drawn from the same distribution and ev aluation restricted to retriev al on small test splits (400 samples). In contrast, our setting inv olves cross-dataset unpaired data and multiple downstream tasks. Nev ertheless, when ev aluated in their setting, SO T Align consistently outperforms SUE ( Y acobi et al. , 2025 ) and its baselines, achieving gains of +14.3 I2T R@5 on COCO, +40.0 on Flickr30k, and +32.5 on Polyvore (see T able 6 ). 7. Conclusion In this work, we introduced a semi-supervised setting for aligning pretrained unimodal encoders, which we believe is relev ant to many real-world modalities where lar ge-scale paired data are scarce. W e argue that V ision–Language alignment provides an ideal testbed for this problem, as abundant paired data enable systematic exploration of different supervision regimes. T o the best of our kno wledge, SO T Align is the ﬁrst model that can effecti vely leverage large-scale unimodal data for multi-modal alignment in this setting. W e hope that the simplicity of SOT Align will inspire future work on multi-modal representation alignment beyond fully supervised re gimes. 8 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Acknowledgments This work was partially funded by the ERC (853489 - DEXIM) and the Alfried Krupp von Bohlen und Halbach Foundation, which we thank for their generous support. This work was also supported by Hi! P ARIS and ANR/France 2030 program (ANR-23- IA CL-0005) and by the French Na- tional Research Agency (ANR) through the France 2030 pro- gram under the MacLeOD project (ANR-25-PEIA-0005). Finally , it recei ved funding from the F ondation de l’ ´ Ecole polytechnique. W e are grateful to R ´ emi Flamary for his revie w of the manuscript. Impact Statement This work aims to advance research in machine learning, particularly in the study of multimodal representation align- ment. While improv ed alignment methods may hav e broad downstream applications, we do not identify any speciﬁc societal impacts that require explicit discussion here. References Adams, R. P . and Zemel, R. S. Ranking via sinkhorn propa- gation. arXiv preprint , 2011. Assel, H. V . Inv erse optimal transport does not require unrolling. April 2024. URL https://huguesva.github.io/blog/2024/ inverseOT_mongegap/ . Babakhin, Y ., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer , B., Liu, B., and Oldridge, E. Llama- embed-nemotron-8b: A uni versal text embedding model for multilingual and cross-lingual tasks. arXiv pr eprint arXiv:2511.07025 , 2025. Bonneel, N., Rabin, J., Peyr ´ e, G., and Pﬁster , H. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and V ision , 51(1):22–45, 2015. Bossard, L., Guillaumin, M., and V an Gool, L. Food-101 – mining discriminati ve components with random forests. In ECCV , pp. 446–461, 2014. Changpinyo, S., Sharma, P ., Ding, N., and Soricut, R. Con- ceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR , 2021. Chen, X., Liang, C., Huang, D., Real, E., W ang, K., Pham, H., Dong, X., Luong, T ., Hsieh, C.-J., Lu, Y ., et al. Sym- bolic discov ery of optimization algorithms. NeurIPS , 36: 49205–49233, 2023. Cherti, M., Beaumont, R., Wightman, R., W ortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., and Jitsev , J. Reproducible scaling laws for contrastiv e language-image learning. In CVPR , pp. 2818–2829, 2023. Cimpoi, M., Maji, S., K okkinos, I., Mohamed, S., and V edaldi, A. Describing textures in the wild. In CVPR , pp. 3606–3613, 2014. Cristianini, N., Shawe-T aylor , J., Elisseef f, A., and Kan- dola, J. On kernel-target alignment. Advances in neural information pr ocessing systems , 14, 2001. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information pr ocessing systems , 26, 2013. Cuturi, M., T eboul, O., Niles-W eed, J., and V ert, J.-P . Super- vised quantile normalization for low rank matrix factoriza- tion. In International Conference on Mac hine Learning , pp. 2269–2279. PMLR, 2020. Dav ari, M., Horoi, S., Natik, A., Lajoie, G., W olf, G., and Belilovsk y , E. Reliability of cka as a similarity measure in deep learning. arXiv preprint , 2022. Deng, J., Dong, W ., Socher , R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR , pp. 248–255, 2009. Dwibedi, D., A ytar , Y ., T ompson, J., Sermanet, P ., and Zisserman, A. W ith a little help from my friends: Nearest- neighbor contrastive learning of visual representations. In CVPR , pp. 9588–9597, 2021. Eisenberger , M., T oker , A., Leal-T aix ´ e, L., Bernard, F ., and Cremers, D. A uniﬁed framew ork for implicit sinkhorn differentiation. In CVPR , pp. 509–518, 2022. Emami, P . and Ranka, S. Learning permutations with sinkhorn policy gradient. arXiv preprint arXiv:1805.07010 , 2018. Enev oldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur , A., Stap, D., Gala, J., Siblini, W ., Krzemi ´ nski, D., Winata, G. I., Sturua, S., Utpala, S., Ciancone, M., Schaef fer , M., Sequeira, G., Misra, D., Dhakal, S., Rystrøm, J., Solomatin, R., ¨ Omer C ¸ a ˘ gatan, Kundu, A., Bernstorf f, M., Xiao, S., Sukhlecha, A., Pahwa, B., Po ´ swiata, R., GV , K. K., Ashraf, S., Auras, D., Pl ¨ uster , B., Harries, J. P ., Magne, L., Mohr , I., Hendriksen, M., Zhu, D., Gisserot- Boukhlef, H., Aarsen, T ., K ostkan, J., W ojtasik, K., Lee, T ., ˇ Suppa, M., Zhang, C., Rocca, R., Hamdy , M., Michail, A., Y ang, J., Faysse, M., V atolin, A., Thakur , N., Dey , M., V asani, D., Chitale, P ., T edeschi, S., T ai, N., Snegire v , A., G ¨ unther , M., Xia, M., Shi, W ., L ` u, X. H., Cliv e, J., Krishnakumar , G., Maksimo va, A., W ehrli, S., T ikhonov a, M., Panchal, H., Abramov , A., Ostendorf f, M., Liu, Z., Clematide, S., Miranda, L. J., Fenogenov a, A., Song, G., Saﬁ, R. B., Li, W .-D., Bor ghini, A., Cassano, F ., Su, 9 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport H., Lin, J., Y en, H., Hansen, L., Hooker , S., Xiao, C., Adlakha, V ., W eller, O., Reddy , S., and Muennighoff, N. Mmteb: Massive multilingual text embedding benchmark. arXiv pr eprint arXiv:2502.13595 , 2025. Flamary , R., Cuturi, M., Courty , N., and Rakotomamonjy , A. W asserstein discriminant analysis. Machine Learning , 107(12):1923–1945, 2018. Flamary , R., Courty , N., Gramfort, A., Alaya, M. Z., Bois- bunon, A., Chambon, S., Chapel, L., Corenﬂos, A., F atras, K., Fournier , N., Gautheron, L., Gayraud, N. T ., Janati, H., Rakotomamonjy , A., Redko, I., Rolet, A., Schutz, A., Seguy , V ., Sutherland, D. J., T av enard, R., T ong, A., and V ayer, T . Pot: Python optimal transport. Journal of Ma- chine Learning Resear ch , 22(78):1–8, 2021. URL http: //jmlr.org/papers/v22/20- 451.html . Flamary , R., V incent-Cuaz, C., Courty , N., Gramfort, A., Kachaiev , O., Quang T ran, H., David, L., Bonet, C., Cassereau, N., Gnassounou, T ., T anguy , E., Delon, J., Col- las, A., Mazelet, S., Chapel, L., Kerdoncuf f, T ., Y u, X., Feickert, M., Krzakala, P ., Liu, T ., and Fernandes Mon- tesuma, E. Pot python optimal transport (version 0.9.5), 2024. URL https://github.com/PythonOT/ POT . Genev ay , A., Peyr ´ e, G., and Cuturi, M. Learning genera- tiv e models with sinkhorn div ergences. In International Confer ence on Artiﬁcial Intelligence and Statistics , pp. 1608–1617. PMLR, 2018. Gower , J. C. and Dijksterhuis, G. B. Pr ocrustes pr oblems , volume 30. Oxford univ ersity press, 2004. Gr ¨ oger , F ., W en, S., Le, H., and Brbic, M. W ith limited data for multimodal alignment, let the STRUCTURE guide you. In NeurIPS , 2025. Han, X., W u, Z., Jiang, Y .-G., and Davis, L. S. Learning fashion compatibility with bidirectional lstms. In Pr o- ceedings of the 25th ACM International Conference on Multimedia , pp. 1078–1086. Association for Computing Machinery , 2017. ISBN 9781450349062. Huang, W ., W u, A., Y ang, Y ., Luo, X., Y ang, Y ., Hu, L., Dai, Q., W ang, C., Dai, X., Chen, D., Luo, C., and Qiu, L. Llm2clip: Powerful language model unlocks richer visual representation. arXiv pr eprint arXiv:2411.04997 , 2025. Huh, M., Cheung, B., W ang, T ., and Isola, P . Position: The platonic representation hypothesis. In ICML , pp. 20617–20642, 2024. Jha, R., Zhang, C., Shmatikov , V ., and Morris, J. X. Har- nessing the univ ersal geometry of embeddings. arXiv pr eprint arXiv:2505.12540 , 2025. Jia, C., Y ang, Y ., Xia, Y ., Chen, Y .-T ., Parekh, Z., Pham, H., Le, Q., Sung, Y .-H., Li, Z., and Duerig, T . Scaling up visual and vision-language representation learning with noisy text supervision. In ICML , pp. 4904–4916, 2021. Krizhevsk y , A. Learning multiple layers of features from tiny images. 2009. Krzakala, P ., Melo, G., Laclau, C., d’Alch ´ e Buc, F ., and Flamary , R. The quest for the graph lev el autoencoder (grale). arXiv preprint , 2025. Lee, C., Roy , R., Xu, M., Raiman, J., Shoeybi, M., Catan- zaro, B., and Ping, W . NV-embed: Improved techniques for training LLMs as generalist embedding models. In ICLR , 2025. Li, Y ., Liang, F ., Zhao, L., Cui, Y ., Ouyang, W ., Shao, J., Y u, F ., and Y an, J. Supervision exists everywhere: A data efﬁcient contrastive language-image pre-training paradigm. In ICLR , 2022. Lin, T .-Y ., Maire, M., Belongie, S., Hays, J., Perona, P ., Ramanan, D., Doll ´ ar , P ., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV , pp. 740–755, 2014. Liu, X., Bai, Y ., Mart ´ ın, R. D., Shi, K., Shahbazi, A., Land- man, B. A., Chang, C., and K olouri, S. Linear spherical sliced optimal transport: A fast metric for comparing spherical data. arXiv preprint , 2024. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and V edaldi, A. Fine-grained visual classiﬁcation of aircraft. arXiv pr eprint arXiv:1306.5151 , 2013. Maniparambil, M., Akshulakov , R., Djilali, Y . A. D., El Amine Seddik, M., Narayan, S., Mangalam, K., and O’Connor , N. E. Do vision and language encoders rep- resent the world similarly? In CVPR , pp. 14334–14343, 2024. Maniparambil, M., Akshulakov , R., Djilali, Y . A. D., Narayan, S., Singh, A., and O’Connor, N. E. Harness- ing frozen unimodal encoders for ﬂexible multimodal alignment. In CVPR , pp. 29847–29857, 2025. Mardia, K. V ., Kent, J. T ., and T aylor , C. C. Multivariate analysis . John Wile y & Sons, 2024. Merity , S., Xiong, C., Bradb ury , J., and Socher , R. Pointer sentinel mixture models, 2016. Mo, S., Kim, M., Lee, K., and Shin, J. S-CLIP: Semi- supervised vision-language learning using few specialist captions. In NeurIPS , 2023. 10 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Nilsback, M.-E. and Zisserman, A. Automated ﬂower clas- siﬁcation over a large number of classes. In 2008 Sixth Indian Confer ence on Computer V ision, Graphics & Im- age Pr ocessing , pp. 722–729, 2008. Oord, A. v . d., Li, Y ., and V inyals, O. Representation learn- ing with contrastiv e predictiv e coding. arXiv preprint arXiv:1807.03748 , 2018. Oquab, M., Darcet, T ., Moutakanni, T ., V o, H. V ., Szafraniec, M., Khalidov , V ., Fernandez, P ., HAZIZA, D., Massa, F ., El-Nouby , A., Assran, M., Ballas, N., Galuba, W ., Howes, R., Huang, P .-Y ., Li, S.-W ., Misra, I., Rabbat, M., Sharma, V ., Synnae ve, G., Xu, H., Je gou, H., Mairal, J., Labatut, P ., Joulin, A., and Bojanowski, P . DINOv2: Learning rob ust visual features without su- pervision. T ransactions on Machine Learning Resear ch , 2024. Peyr ´ e, G., Cuturi, M., et al. Computational optimal trans- port: W ith applications to data science. F oundations and T r ends® in Machine Learning , 11(5-6):355–607, 2019. Plummer , B. A., W ang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier , J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV , 2015. Radford, A., Kim, J. W ., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Askell, A., Mishkin, P ., Clark, J., Krueger , G., and Sutskev er , I. Learning transferable visual models from natural language supervision. In ICML , pp. 8748–8763, 2021. Schnaus, D., Araslanov , N., and Cremers, D. It’ s a (blind) match! to wards vision-language correspondence without parallel data. In Pr oceedings of the Computer V ision and P attern Recognition Conference , pp. 24983–24992, 2025. Sch ¨ onemann, P . H. A generalized solution of the orthogonal procrustes problem. Psychometrika , 31(1):1–10, 1966. Sharma, P ., Ding, N., Goodman, S., and Soricut, R. Con- ceptual captions: A cleaned, hypernymed, image alt-te xt dataset for automatic image captioning. In Gure vych, I. and Miyao, Y . (eds.), A CL , pp. 2556–2565, 2018. Shi, L., Fan, J., and Y an, J. OT-CLIP: Understanding and generalizing CLIP via optimal transport. In ICML , 2024. Sim ´ eoni, O., V o, H. V ., Seitzer , M., Baldassarre, F ., Oquab, M., Jose, C., Khalidov , V ., Szafraniec, M., Y i, S., Rama- monjisoa, M., Massa, F ., Haziza, D., W ehrstedt, L., W ang, J., Darcet, T ., Moutakanni, T ., Sentana, L., Roberts, C., V edaldi, A., T olan, J., Brandt, J., Couprie, C., Mairal, J., J ´ egou, H., Labatut, P ., and Bojanowski, P . Dinov3. arXiv pr eprint arXiv:2508.10104 , 2025. Tschannen, M., Gritsenko, A., W ang, X., Naeem, M. F ., Alabdulmohsin, I., Parthasarathy , N., Evans, T ., Beyer , L., Xia, Y ., Mustafa, B., H ´ enaff, O., Harmsen, J., Steiner , A., and Zhai, X. SigLIP 2: Multilingual vision- language encoders with impro ved semantic understand- ing, localization, and dense features. arXiv pr eprint arXiv:2502.14786 , 2025. Uscidda, T . and Cuturi, M. The monge gap: A regularizer to learn all transport maps. In International Confer ence on Machine Learning , pp. 34709–34733. PMLR, 2023. V an Assel, H., V ayer , T ., Flamary , R., and Courty , N. Snekhorn: Dimension reduction with symmetric entropic afﬁnities. Advances in Neural Information Pr ocessing Systems , 36:44470–44487, 2023. V ouitsis, N., Liu, Z., Gorti, S. K., V illecroze, V ., Cresswell, J. C., Y u, G., Loaiza-Ganem, G., and V olkovs, M. Data- ef ﬁcient multimodal fusion on a single gpu. In CVPR , pp. 27239–27251, 2024. Y acobi, A., Ben-Ari, N., T almon, R., and Shaham, U. Learning shared representations from unpaired data. In NeurIPS , 2025. Zhai, X., Mustafa, B., K olesnikov , A., and Beyer , L. Sig- moid loss for language image pre-training. In ICCV , pp. 11941–11952, 2023. Zhang, L., Y ang, Q., and Agrawal, A. Assessing and learn- ing alignment of unimodal vision and language models. In CVPR , pp. 14604–14614, 2025a. Zhang, Y ., Li, M., Long, D., Zhang, X., Lin, H., Y ang, B., Xie, P ., Y ang, A., Liu, D., Lin, J., et al. Qwen3 embed- ding: Adv ancing text embedding and reranking through foundation models. arXiv pr eprint arXiv:2506.05176 , 2025b. Zheng, K., Zhang, Y ., W u, W ., Lu, F ., Ma, S., Jin, X., Chen, W ., and Shen, Y . Dreamlip: Language-image pre-training with long captions. In ECCV , 2024. 11 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport A. Experimental Setting W e outline the experimental setup in Section 6.3 . Here, we provide further details on our implementation and baselines. A.1. Implementation Details Follo wing Zhang et al. ( 2025a ), we create global image representations by concatenating the [CLS] token with the mean of the remaining patch tokens. T ext features are computed by a veraging all patch tokens. W e project both modalities into a shared embedding space of dimensionality d = 1024 using linear layers f and g . W e found them to be more robust in our low-supervision re gime compared to non-linear layers. When performing CCA, we add a regularization λ = 0 . 1 to the eigenv alues of matrices that need to be inv erted. Our div ergence, KLO T , is computed using the Sinkhorn algorithm with n = 100 iterations in both spaces. W e set the entropic regularization to ϵ = 0 . 01 in the reference space and ϵ = 0 . 05 in the joint embedding space. Our experiments are conducted with 10k paired samples and, when applicable, up to 1M unpaired images and te xts drawn from CC3M ( Sharma et al. , 2018 ). W e train all models using a maximum batch size of 32k, composed of up to 10k paired samples and completed with unpaired images and text. If there are less than 32k total samples av ailable in our robustness studies, we adjust the batch size accordingly . W e use the LION optimizer ( Chen et al. , 2023 ) with a cosine annealing learning-rate schedule, a maximum learning rate of 10 − 4 , and a weight decay of 10 − 5 , and train for 2000 iterations. W e mainly employ DINOv3 V iT -L ( Sim ´ eoni et al. , 2025 ) and NV -Embed-v2 ( Lee et al. , 2025 ) as the pretrained vision and language encoders, respectiv ely . In Section 6.2 , we additionally e valuate SO T Align with DINOv2 V iT -L ( Oquab et al. , 2024 ), Qwen3-Embedding-8B ( Zhang et al. , 2025b ), and Llama-Embed-Nemotron-8B ( Babakhin et al. , 2025 ). All of these language models are among the top performing models in the MMTEB benchmark ( Enev oldsen et al. , 2025 ). Our main e valuation metric is the average of the text-to-image (T2I) and image-to-text (I2T) retrie val (Recall@1) performance on the COCO v alidation set which we denote MeanR@1. Whene ver required, we use a similar score on the CC3M v alidation split for hyperparameter selection. All experiments can be run on a single A100 GPU with 80 GB memory . A.2. Baselines In Section Section 6.3 , we compare SOT Align against several supervised and semi-supervised baselines in zero-shot image classiﬁcation and retriev al. For each baseline, we consider various conﬁgurations as detailed below , and report their optimal performance after hyperparameter tuning. SAIL ( Zhang et al. , 2025a ) performs contrastive learning of alignment layers with the SigLIP ( Zhai et al. , 2023 ) loss exclusi vely on paired data. This method represents a series of recent supervised contrastiv e methods for the alignment of pretrained unimodal vision and language models ( V ouitsis et al. , 2024 ; Maniparambil et al. , 2025 ; Huang et al. , 2025 ). Follo wing Zhang et al. ( 2025a ), we initialize the logit scale to 20 and the logit bias to − 10 and allow both parameters to be trained. W e examine the e xtension of SAIL to our semi-supervised setting by incorporating unpaired samples as additional negati ves in the SigLIP loss. STR UCTURE ( Gr ¨ oger et al. , 2025 ) aligns pretrained encoders in low-resource regimes by augmenting the contrastive objectiv e with an additional loss that forces the similarity distribution in the joint embedding space to lie between the unimodal similarity distributions. While STR UCTURE focuses on a fully supervised setting with paired data, we also ev aluate the strength of its regularization term on unpaired data in our semi-supervised setting. W e set the number of le vels to 1 and the temperature in the softmax function to τ = 0 . 07 . W e tune the weight of the structure preserv ation term over λ ∈ { 0 . 1 , 1 , 10 , 100 , 1000 } , and consider both no warmup and a 500-step warmup schedule. Further semi-supervised techniques can be borro wed from contrastive pretraining and low-resource domain adaptation to construct pseudo-pairs based on similarity distributions in the unimodal or joint embedding spaces. NNCLR ( Dwibedi et al. , 2021 ) enriches contrasti ve learning by retrieving the nearest-neighbors of an instance and using them as additional positi ves. DeCLIP ( Li et al. , 2022 ) has adopted such nearest-neighbor supervision in image-language pre-training. W e follow this line of work and utilize unpaired images and te xt as augmentation for the fe w paired samples. Speciﬁcally , for a gi ven image, we ﬁnd the closest neighbor of its paired caption in the unimodal language space, which then 12 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport serves as an additional positi ve for the image. Similarly , for a giv en text, we ﬁnd the closest neighbor of its paired image in the unimodal vision space, and can use it as an additional positiv e for the text. While NNCLR is often implemented with a queue containing the last fe w batches during training, we can compute the nearest neighbors for all CC3M samples in the unimodal spaces a priori since we utilize pretrained encoders. When training the alignment layers, we then randomly sample a nearest neighbor from the top k ∈ { 1 , 5 , 10 } neighbors, and further perform hyperparameter search for the weights of the contrastiv e losses with the additional positives: w img , w text ∈ { 0 , 0 . 1 , 1 , 10 } . S-CLIP ( Mo et al. , 2023 ) addresses the domain adaption of CLIP with pseudo-labeling at the caption and ke yword lev el. W e ev aluate their caption-lev el supervision in our setting. Giv en an unpaired image, S-CLIP computes similarity scores to paired images, and then uses the resulting similarity distribution to determine pseudo positives from the paired text. The pseudo-positiv es can be chosen in a hard assignment as the single nearest neighbor (ar gmax of the distrib ution, similar to NNCLR) or in a soft assignment as a weighted average of representations. A key component of S-CLIP is its use of O T to ﬁnd the optimal matching between unpaired and paired images. The method is limited by the small pool of positiv es. W e apply S-CLIP in the unimodal vision and language spaces as well as in the joint-embedding space. W e search for pseudo-labels for both unpaired images as well as unpaired text and tune their corresponding weights in the ﬁnal objecti ve via a grid search: w img , w text ∈ { 0 , 0 . 1 , 1 , 10 } . SUE ( Y acobi et al. , 2025 ) studies the alignment of unimodal encoders on a single image-text dataset. Their approach combines learnable spectral embeddings on unpaired data, with CCA on paired data for linear alignment, and a residual network to further reﬁne the alignment. A.3. Datasets W e construct our semi-supervised setting primarily using CC3M ( Sharma et al. , 2018 ) with both raw web captions and synthetic captions generated by DreamLIP ( Zheng et al. , 2024 ). W e further e xperiment with disjoint images and texts from CC12M ( Changpinyo et al. , 2021 ), COCO ( Lin et al. , 2014 ), ImageNet ( Deng et al. , 2009 ), and W ikiT e xt ( Merity et al. , 2016 ). W e select models based on their a verage te xt-to-image (T2I) and image-to-te xt (I2T) retrie val performance on the CC3M validation set. In Section 6.3 , we ev aluate our model in a zero-shot setting across a diverse suite of classiﬁcation and retrie val benchmarks. • Classiﬁcation: ImageNet ( Deng et al. , 2009 ), Food-101 ( Bossard et al. , 2014 ), CIF AR-10 ( Krizhe vsky , 2009 ), CIF AR- 100 ( Krizhevsky , 2009 ), Aircraft ( Maji et al. , 2013 ), DTD ( Cimpoi et al. , 2014 ), Flowers ( Nilsback & Zisserman , 2008 ) • Retrie val (T2I, I2T): COCO ( Lin et al. , 2014 ), Flickr30 ( Plummer et al. , 2015 ) 13 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport F igure 6. Comparison of memory usage, runtime, and number of Sinkhorn iterations for different gradient computation strategies. W e run as many Sinkhorn iterations as required to achie ve marginal con vergence within a tolerance of 10 − 6 . B. Additional Experiments B.1. Sinkhorn Backpr opagation The Sinkhorn algorithm has recently been used as a differentiable layer in a wide range of applications, including reinforce- ment learning ( Emami & Ranka , 2018 ), learning to rank ( Adams & Zemel , 2011 ), discriminant analysis ( Flamary et al. , 2018 ), graph matching ( Krzakala et al. , 2025 ), and representation learning ( V an Assel et al. , 2023 ). Closer to our setting, sev eral recent works have applied Sinkhorn-based objecti ves to contrasti ve learning for vision–language models ( Mo et al. , 2023 ; Shi et al. , 2024 ). While the Sinkhorn algorithm (deﬁned in Appendix C.2 ) is dif ferentiable in theory , computing its gradient in practice is challenging. The most common approach consists in unr olling the Sinkhorn iterations and directly backpropagating through the solver . Howe ver , this strategy incurs a large memory o verhead, as the full computational graph must be retained for all iterations, causing memory consumption to grow rapidly with the number of Sinkhorn steps. An alternati ve is to rely on implicit differentiation, which amounts to solving the linear system deﬁned by the optimality conditions of the entropic O T problem. In the case of Sinkhorn, this system exhibits a particular structure that enables more ef ﬁcient solvers ( Cuturi et al. , 2020 ; Eisenber ger et al. , 2022 ). While this approach alleviates the memory e xplosion associated with unrolling, it remains computationally expensi ve in practice. In the context of our proposed di vergence, KLOT( K || K ∗ ) = KL( O T ϵ ∗ ( K ∗ ) || OT ϵ ( K )) , (15) a naiv e application of the chain rule would suggest that computing the gradient ∇ K KLOT( K || K ∗ ) requires explicitly forming the Jacobian ∂ O T ϵ ( K ) ∂ K , thereby necessitating either Sinkhorn unrolling or implicit differentiation. Crucially , Theorem 5.1 shows that this is not required. Instead, the gradient of KLOT admits a closed-form expression that can be computed directly , without e v aluating the Jacobian of the Sinkhorn operator . T o empirically illustrate the efﬁcienc y of this result, we extract an n × n afﬁnity matrix with n = 10 k from a checkpoint of SAIL training and compare three strategies for computing ∇ K KLOT( K || K ∗ ) : Sinkhorn unrolling, implicit dif ferentiation, and our closed-form gradient. The results are reported in Figure 6 . Our approach signiﬁcantly outperforms both alternati ves in terms of memory usage and runtime. In particular , depending on the value of ϵ , which controls the number of Sinkhorn iterations (with con vergence scaling as O (1 /ϵ 2 ) ), the proposed method can be up to 100 × more memory efﬁcient than unrolling and up to 50 × faster than implicit dif ferentiation. 14 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport 100 1k 5k 10k 50k 100k Number of pair ed samples 0 10 20 30 40 50 T op-1 A ccuracy SOT Align (Ours) Supervised S AIL (a) ImageNet classiﬁcation 100 1k 5k 10k 50k 100k Number of pair ed samples 0 5 10 15 20 25 30 T2I R@1 SOT Align (Ours) Supervised S AIL (b) COCO T2I retriev al 100 1k 5k 10k 50k 100k Number of pair ed samples 0 10 20 30 40 I2T R@1 SOT Align (Ours) Supervised S AIL (c) COCO I2T retriev al F igure 7. Effect of the number of paired samples during alignment on downstream zero-shot classiﬁcation and retriev al. W e ﬁx 1M unpaired samples from CC3M and vary the number of paired samples. B.2. Robustness of SO T Align In Section 6.2 , we analyze the robustness of SO T Align to variations in both the amount and the source of supervised and unsupervised data. Here, we report the full set of results. Figure 7 sho ws how zero-shot classiﬁcation and retriev al performance vary as a function of the number of paired sampl es used during alignment. Figure 8 illustrates the effect of increasing the number of unpaired samples, in comparison to the supervised SAIL baseline. T able 8 reports zero-shot classiﬁcation and retriev al results for different combinations of unimodal vision and language encoders, along with absolute gains ov er supervised SAIL. Finally , Figure 9 relates these performances to the mutual k -NN similarity between encoder pairs, revealing a strong correlation ( r = 0 . 83 ), although additional data points will be required to draw ﬁrm conclusions. B.3. Benchmarking Semi-Supervised Alignment T able 9 reports retriev al performance on COCO and Flickr30k, while T able 10 presents zero-shot image classiﬁcation accuracy across a v ariety of downstream datasets. In Section 6.3 , we ev aluate SOT Align in a semi-supervised alignment setting proposed by Y acobi et al. ( 2025 ). W e train for alignment on a single dataset, and ev aluate retrie val on the test split of the same dataset (with only 400 test instances for retriev al). The datasets are: COCO ( Lin et al. , 2014 ), Flickr30k ( Plummer et al. , 2015 ), and Polyvore ( Han et al. , 2017 ). Y acobi et al. ( 2025 ) use MLPs for alignment and an embedding dimensionality of 8. In T able 11 , we report the performance of SO T Align adhering to their architectural choices. Our method achieves gains of +5.5 on COCO, +28.7 on Flickr30k, and +18.2 on Polyvore I2T R@5. Howe ver , if we lift these constraints and instead use linear alignment layers with a target dimension of 512, performance increases further , reaching +14.3 on COCO, +40.0 on Flickr30k, and +32.5 on Polyvore. B.4. Quantifying the distribution shift W e study the effect of the distribution shift arising from the use of unpaired data ( X, Y ) drawn from dif ferent sources from those of the paired data ( A, B ) using the total spherical sliced W asserstein distance (see Appendix C.2 ) computed using the Python library PO T ( Flamary et al. , 2021 ; 2024 ). In all experiments, we set p = 2 and use N = 500 projection directions. Distances between unimodal datasets are reported in Figure 12 av eraged over 20 random seeds corresponding to a different subset of 100 000 samples of the dataset and projection set. All distances are computed between embeddings from Dinov3 and NV -Embed-V2 for images and text respecti vely . In Figure 5 we e xhibit a strong correlation between distance and performance, which provides a good proxy for performance that can be computed without any training or inference. In addition, this correlation is e ven stronger when one of the unimodal unpaired dataset is ﬁx ed to be CC3M and the other is v aried. W e sho w that performance is strongly correlated with the SSW distances between the unimodal datasets, S S W ( B , Y ) and S S W ( A, X ) in Figure 10 and 11 respecti vely . 15 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport 10k 50k 100k 500k 1M Number of unpair ed samples 36 38 40 42 44 46 T op-1 A ccuracy SOT Align (Ours) Supervised S AIL (a) ImageNet classiﬁcation 10k 50k 100k 500k 1M Number of unpair ed samples 21 22 23 24 25 26 T2I R@1 SOT Align (Ours) Supervised S AIL (b) COCO T2I 10k 50k 100k 500k 1M Number of unpair ed samples 28 30 32 34 I2T R@1 SOT Align (Ours) Supervised S AIL (c) COCO I2T F igure 8. Effect of the number of unpaired samples during alignment on do wnstream zero-shot classiﬁcation and retriev al. W e ﬁx 10k paired samples from CC3M and vary the number of unpaired samples. 0.14 0.15 0.16 0.17 0.18 0.19 Mutual k-NN 20 22 24 26 28 30 Mean R@1 P e a r s o n r = 0 . 8 3 V i s i o n M o d e l DINOv3 V iT -L DINOv2 V iT -L L a n g u a g e M o d e l NV -Embed- v2 Qwen3-Embedding-8B Llama-Embed-Nemotr on-8B F igure 9. (R@1COCO) vs mutual k-NN. 0.000 0.005 0.010 0.015 0.020 0.025 Sliced W asserstein distance to CC3M 24 26 28 30 32 34 COCO Mean R@1 Pearson r = -0.89 W ikiT e xt103 CC12M CC3M CC3M-s SOT Align (Ours) Supervised S AIL 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Sliced W asserstein distance to CC3M 36 38 40 42 44 46 ImageNet 1K Pearson r = -0.96 COCO W ikiT e xt103 CC12M CC3M CC3M-s SOT Align (Ours) Supervised S AIL F igure 10. Performance when using CC3M as paired data, CC3M text as unpaired text, and other image datasets as unpaired images, together with a comparison to the spherical sliced W asserstein distance between CC3M image and the other image datasets. 16 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport T able 7. W e train SO T Align on 10k image-te xt pairs from CC3M and up to 1M samples from v arying unimodal datasets and report the zero-shot classiﬁcation (top-1 accuracy) and retrie val (R@1). For comparison we report SAIL trained with as many samples and SAIL 1M i.e. the version trained using 1M supervised samples from CC3M (100x more supervision). Method Unpaired Images Unpaired T ext ImageNet-1K COCO T2I COCO I2T SAIL 1M — — 56.4 35.5 45.5 SAIL — — 35.6 21.0 27.4 SO T Align CC3M CC3M 46.1 26.5 34.1 SO T Align CC3M CC3M synth. 46.2 30.4 39.7 SO T Align CC3M CC12M 44.6 24.5 32.0 SO T Align CC12M CC3M 43.8 24.9 32.8 SO T Align CC12M CC12M 46.8 25.7 34.4 SO T Align ImageNet CC3M 43.4 23.8 31.5 SO T Align ImageNet CC12M 44.3 24.2 31.5 SO T Align CC3M COCO 38.4 28.4 26.7 SO T Align CC12M COCO 38.3 27.7 30.6 SO T Align ImageNet COCO 40.6 25.5 26.1 SO T Align COCO CC3M 38.0 21.7 33.9 SO T Align COCO CC12M 38.5 22.0 33.1 SO T Align CC3M W ikiT ext103 39.8 21.4 27.8 SO T Align COCO W ikiT ext103 37.1 19.5 29.4 SO T Align ImageNet W ikiT ext103 40.7 20.7 28.1 T able 8. SO T Align trained on 10k pairs and 1M unpaired samples from CC3M with different unimodal encoders. Zero-shot classiﬁcation (top-1 accuracy) and retrie val (R@1). Green values represent the absolute g ain over supervised SAIL. V ision Language Mutual k-NN Method ImageNet COCO COCO Model Model 1K T2I I2T DINOv2 Nemotron-8B 14.6 SAIL 25.5 11.7 18.0 SO T Align 32.4 +6.9 15.5 +3.8 23.3 +5.3 Qwen3-8B 18.9 SAIL 31.8 16.8 23.8 SO T Align 39.5 +7.7 20.9 +4.1 31.1 +7.3 NV -Embed-v2 18.2 SAIL 32.7 19.0 23.4 SO T Align 42.5 +9.8 23.1 +4.1 31.1 +7.7 DINOv3 Nemotron-8B 14.1 SAIL 25.4 12.8 21.0 SO T Align 35.5 +10.1 16.6 +3.8 26.2 +5.2 Qwen3-8B 18.0 SAIL 33.4 19.1 28.0 SO T Align 42.7 +9.3 24.1 +5.0 35.3 +7.3 NV -Embed-v2 17.6 SAIL 35.6 21.0 27.4 SO T Align 46.1 +10.5 26.5 +5.5 34.1 +6.7 17 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport T able 9. Zero-shot text-image retriev al (Recall@K) on COCO and Flickr30k. Comparison of supervised and semi-supervised methods trained with 10k image-text pairs and 1M unpaired samples from CC3M. Upper bound with 1M supervised pairs in gre y . COCO Flickr30k Method T2I I2T T2I I2T R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 Sup. SAIL (1M) 35.5 61.0 45.5 72.0 63.1 87.2 75.0 94.4 SAIL 21.0 44.3 27.4 51.7 45.7 75.1 54.1 81.2 STR UCTURE 21.0 43.7 28.7 52.7 46.8 74.9 54.0 82.9 Semi-sup. SAIL 20.7 43.6 26.5 51.7 44.9 74.2 53.1 82.1 STR UCTURE 20.9 43.5 28.0 52.2 45.7 74.9 56.0 83.3 NNCLR 21.3 44.4 27.9 52.2 46.6 75.3 52.9 82.1 S-CLIP 20.4 42.6 27.8 50.5 44.5 74.4 52.6 82.3 SO T Align (Ours) 26.5 49.8 34.1 59.4 51.7 79.2 60.8 85.7 T able 10. Zero-shot image classiﬁcation (top-1 accuracy). Comparison of supervised and semi-supervised methods trained with 10k image-text pairs and 1M unpaired samples from CC3M. Upper bound with 1M supervised pairs in gre y .. Method Food-101 CIF AR-10 CIF AR-100 Aircraft DTD Flowers ImageNet Sup. SAIL (1M) 63.9 97.8 82.3 9.7 53.5 47.2 56.4 SAIL 36.4 96.2 71.2 3.9 36.8 24.1 35.6 STR UCTURE 38.5 96.7 72.2 5.4 39.5 23.6 38.2 Semi-sup. SAIL 36.5 96.2 71.3 3.8 35.9 21.1 35.6 STR UCTURE 37.6 96.5 70.8 4.9 38.7 23.2 36.8 NNCLR 37.9 96.5 73.0 3.8 38.8 24.6 37.4 S-CLIP 35.3 95.9 69.3 4.4 37.6 22.5 36.4 SO T Align (Ours) 50.0 97.5 78.3 5.0 42.4 30.1 46.1 T able 11. Alignment per dataset. Follo wing the setup of SUE ( Y acobi et al. , 2025 ), we train for alignment per dataset and evaluate image- text retrie val (Recall@5). W e report SO T Align results adhering to the architectural choices of SUE (MLP , embedding dimensionality of 8) and without these constraints. COCO Flickr30k P olyvore 100 Pairs 500 Pairs 500 Pairs Method I2T T2I I2T T2I I2T T2I CSA 1.3 1.0 1.3 0.8 1.3 1.0 Contrastiv e 8.5 5.8 9.5 9.8 13.8 11.5 SUE 21.5 18.3 19.8 22.0 22.8 20.8 SO T Align (with SUE constraints) 27.0 28.8 48.5 48.8 41.0 39.8 SO T Align (without SUE constraints) 35.8 35.0 59.8 63.3 55.3 55.3 18 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport 0.000 0.005 0.010 0.015 0.020 Sliced W asserstein distance to CC3M 24 25 26 27 28 29 30 COCO Mean R@1 Pearson r = -0.99 CC3M CC12M Imagenet Supervised S AIL SOT Align (Ours) 0.000 0.003 0.005 0.007 0.010 0.013 0.015 0.018 Sliced W asserstein distance to CC3M 36 38 40 42 44 46 ImageNet 1K Pearson r = -0.97 CC3M CC12M COCO Supervised S AIL SOT Align (Ours) F igure 11. Performance when using CC3M as paired data, CC3M text as unpaired text, and other image datasets as unpaired images, together with a comparison to the spherical sliced W asserstein distance between CC3M image and the other image datasets. CC3M CC12M COCO Imagenet CC3M CC12M COCO Imagenet 0.001 ± 0.0000 0.009 ± 0.0002 0.018 ± 0.0004 0.021 ± 0.0004 0.009 ± 0.0002 0.001 ± 0.0000 0.018 ± 0.0003 0.021 ± 0.0003 0.018 ± 0.0004 0.018 ± 0.0003 0.001 ± 0.0001 0.016 ± 0.0003 0.021 ± 0.0004 0.021 ± 0.0003 0.016 ± 0.0003 0.001 ± 0.0000 CC3M CC3M-s CC12M COCO W ikiT e xt103 CC3M CC3M-s CC12M COCO W ikiT e xt103 0.001 ± 0.0000 0.001 ± 0.0000 0.017 ± 0.0003 0.033 ± 0.0006 0.026 ± 0.0005 0.001 ± 0.0000 0.001 ± 0.0000 0.017 ± 0.0003 0.033 ± 0.0006 0.026 ± 0.0005 0.017 ± 0.0003 0.017 ± 0.0003 0.001 ± 0.0000 0.037 ± 0.0009 0.028 ± 0.0005 0.033 ± 0.0006 0.033 ± 0.0006 0.037 ± 0.0009 0.002 ± 0.0001 0.037 ± 0.0005 0.026 ± 0.0005 0.026 ± 0.0005 0.028 ± 0.0005 0.037 ± 0.0005 0.004 ± 0.0002 F igure 12. Spherical sliced W asserstein distances between dif ferent image datasets (left) and text datasets (right). W e report mean and std of the distances ov er 20 seeds. 19 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport C. Mathematical Details C.1. Linear Alignment Models W e now provide the closed-form solutions for the proposed linear alignment models. Procrustes. The classical Orthogonal Procrustes problem is deﬁned for two point clouds A, B ∈ R n × d and seeks an orthogonal transformation that best aligns A to B . It can be written as max P ∈ R d × d ⟨ P A ⊤ , B ⟩ s.t. P P ⊤ = I d . (16) This formulation learns a single linear mapping from A to B and implicitly assumes that both point clouds lie in the same ambient space R d . Howe ver , Procrustes alignment is known to admit ﬂexible generalizations beyond this setting ( Go wer & Dijksterhuis , 2004 ). In particular , it can be extended to handle representations of different dimensionalities and to learn projections into a shared lower -dimensional space. W e no w introduce a natural variant of Procrustes alignment that is better suited to our setting. Proposition C.1 (Closed form solution of Procrustes Alignment) . Let A ∈ R n × d a and B ∈ R n × d b , and let d ′ ≤ min { d a , d b } . Consider the optimization pr oblem max P ∈ R d ′ × d a , Q ∈ R d ′ × d b ⟨ P A ⊤ , B Q ⊤ ⟩ s.t. P P ⊤ = I d ′ , QQ ⊤ = I d ′ . (17) Let the singular value decomposition of A ⊤ B be A ⊤ B = U Σ V ⊤ , with singular values in non-incr easing order . Then an optimal solution is given by W x = U : , 1: d ′ , W y = V : , 1: d ′ . Pr oof. W e introduce the change of v ariables ˜ P = P U, ˜ Q = QV . Since U and V are orthogonal, ˜ P and ˜ Q also satisfy ˜ P ˜ P ⊤ = ˜ Q ˜ Q ⊤ = I d ′ . Using in variance of the Frobenius inner product under orthogonal transformations, the objectiv e rewrites as ⟨ P A ⊤ , B Q ⊤ ⟩ = ⟨ ˜ P Σ , ˜ Q ⟩ . By the Cauchy–Schwarz inequality , ⟨ ˜ P Σ , ˜ Q ⟩ ≤ ∥ ˜ P Σ ∥ F ∥ ˜ Q ∥ F . Since ˜ Q ∈ R d ′ × d b has orthonormal rows, we ha ve ∥ ˜ Q ∥ 2 F = tr( ˜ Q ˜ Q ⊤ ) = d ′ . W e now bound ∥ ˜ P Σ ∥ 2 F . By deﬁnition, ∥ ˜ P Σ ∥ 2 F = tr( ˜ P Σ 2 ˜ P ⊤ ) = tr(Σ 2 ˜ P ⊤ ˜ P ) , where we used cyclic in variance of the trace. Since ˜ P has orthonormal rows, the matrix Π = ˜ P ⊤ ˜ P is an orthogonal projector of rank d ′ , with eigen values in { 0 , 1 } and tr(Π) = d ′ . 20 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Let Σ 2 = diag( σ 2 1 , . . . , σ 2 r ) with σ 1 ≥ σ 2 ≥ · · · ≥ σ r ≥ 0 . Then tr(Σ 2 Π) = r X i =1 σ 2 i Π ii . Because 0 ≤ Π ii ≤ 1 for all i and P i Π ii = d ′ , the sum is maximized by assigning weight 1 to the d ′ largest diagonal entries of Σ 2 . Therefore, tr(Σ 2 Π) ≤ d ′ X i =1 σ 2 i . Combining the abov e bounds yields ⟨ ˜ P Σ , ˜ Q ⟩ ≤ √ d ′   d ′ X i =1 σ 2 i   1 / 2 , and the bound is tight when ˜ P = ˜ Q = I d ′ which concludes the proof. Canonical Correlation Analysis (CCA). Canonical Correlation Analysis (CCA) is a classical tool for studying linear relationships between two sets of variables and is widely used in multiv ariate statistics and representation learning. In this work, CCA is already deﬁned in ( 7 ) ; we brieﬂy recall its formulation here in a form that is con venient for deriving its closed-form solution and for highlighting its connection to Procrustes alignment. Denoting Σ x,x = A ⊤ A, Σ x,y = A ⊤ B , Σ y ,y = B ⊤ B , the CCA problem ( 7 ) can be equiv alently rewritten as ( W x , W y ) = arg max P,Q ⟨ P Σ x,y , Q ⟩ s.t. P Σ x,x P ⊤ = I d ′ , Q Σ y ,y Q ⊤ = I d ′ . (18) W e now present a standard deriv ation of the closed-form solution, included for completeness, which makes explicit the relationship between CCA and the Procrustes problem introduced abov e. Proposition C.2 (Closed-form solution of CCA) . Let Σ − 1 / 2 x,x Σ x,y Σ − 1 / 2 y ,y = U Σ V ⊤ be the singular value decomposition, with singular values in non-increasing or der . Then an optimal solution to ( 18 ) is given by W x = U ⊤ : , 1: d ′ Σ − 1 / 2 x,x , W y = V ⊤ : , 1: d ′ Σ − 1 / 2 y ,y . Pr oof. W e introduce the change of v ariables ˜ P = P Σ 1 / 2 x,x , ˜ Q = Q Σ 1 / 2 y ,y . Under this transformation, the constraints become ˜ P ˜ P ⊤ = I d ′ , ˜ Q ˜ Q ⊤ = I d ′ , and the objectiv e rewrites as ⟨ P Σ x,y , Q ⟩ = D ˜ P Σ − 1 / 2 x,x Σ x,y Σ − 1 / 2 y ,y , ˜ Q E . Thus, the CCA problem reduces to an orthogonal Procrustes problem: max ˜ P ˜ P ⊤ = ˜ Q ˜ Q ⊤ = I d ′ D ˜ P M , ˜ Q E , where M = Σ − 1 / 2 x,x Σ x,y Σ − 1 / 2 y ,y . 21 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Let M = U Σ V ⊤ be its singular value decomposition. By the Procrustes result, the maximum is attained for ˜ P = U ⊤ : , 1: d ′ , ˜ Q = V ⊤ : , 1: d ′ . Substituting back yields P = U ⊤ : , 1: d ′ Σ − 1 / 2 x,x , Q = V ⊤ : , 1: d ′ Σ − 1 / 2 y ,y , which concludes the proof. C.2. Optimal T ransport Introduction to Optimal T ransport W e brieﬂy recall the discrete optimal transport (O T) problem and its entropic relaxation. W e refer to ( Peyr ´ e et al. , 2019 ) for more details. Let n ∈ N and denote by P n the set of permutation matrices, P n = { P ∈ { 0 , 1 } n × n | P 1 = 1 , P ⊤ 1 = 1 } , (Permutations Matrices) and by Π n the set of bistochastic matrices, Π n = { T ∈ R n × n + | T 1 = 1 , T ⊤ 1 = 1 } . (T ransport Plans) W e further deﬁne the (negativ e) entropy of a transport plan T ∈ Π n as H ( T ) = X i,j T ij log T ij . (Negati ve Entropy) In the discrete O T setting, we are given tw o sets of points indexed by i, j ∈ { 1 , . . . , n } and a cost matrix C ∈ R n × n , where C ij denotes the cost of transporting mass from point i to point j . The classical Monge formulation seeks the permutatin minimizing the total transport cost, min P ∈P n ⟨ P , C ⟩ . (Monge) This formulation enforces a one-to-one matching and is combinatorial in nature. Kantorovich proposed a con ve x relaxation of this problem by allowing fractional transport plans, min T ∈ Π n ⟨ T , C ⟩ , (Kantorovich) which can be sho wn to be equi valent to the Monge formulation in the discrete balanced setting ( Pe yr ´ e et al. , 2019 ), while being more ﬂexible and amenable to generalizations such as non-uniform mar ginals and continuous measures. When the cost matrix is deﬁned as C i,j = d ( x i , y j ) p , where d is a distance on the underlying space and p ≥ 1 , the optimal value of the Kantoro vich problem induces the p-W asserstein distance, deﬁned as W p =  min T ∈ Π n ⟨ T , C ⟩  1 /p . (W asserstein distance) T o further improv e computational tractability , Cuturi ( 2013 ) introduced the entropic regularized O T problem, also known as the Sinkhorn relaxation, min T ∈ Π n ⟨ T , C ⟩ + εH ( T ) , (19) where ε > 0 controls the strength of the regularization. This formulation yields a strictly con vex objectiv e and can be efﬁciently solv ed using the Sinkhorn algorithm. Sliced W asserstein distance Although entropically regularized optimal transport can be efﬁciently solved using the Sinkhorn algorithm, its computational complexity remains O ( n 2 ) , which becomes prohibitiv e when comparing distributions supported on millions of high-dimensional points. T o address this limitation, the sliced W asserstein distance (SW) was introduced ( Bonneel et al. , 2015 ). The key observation underlying this approach is that the W asserstein distance between one-dimensional distributions admits a closed-form solution obtained by sorting the samples and matching them 22 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport monotonically . The sliced W asserstein distance exploits this property by projecting high-dimensional distributions onto multiple one-dimensional subspaces and av eraging the resulting one-dimensional W asserstein distances. Let µ = P n i =1 a i δ x i and ν = P m j =1 b i δ y j be two discrete probability measures with x i , y j ∈ R d . For a given projection θ ∈ S d − 1 , we deﬁne the projected one dimensional distributions as µ θ = n X i =1 a i δ ⟨ x i ,θ ⟩ , ν θ = m X j =1 b j δ ⟨ y j ,θ ⟩ . (20) Giv en a set of projection directions ( θ 1 , ..., θ N ) , the p-SW distance is deﬁned as S W p ( µ, ν ) = 1 N N X i =1 W p p ( µ θ i , ν θ i ) ! 1 /p . (21) When data are constrained to the unit sphere, the spherical sliced W asserstein (SSW) distance ( Liu et al. , 2024 ) replaces linear projections with angular projections and computes optimal transport on the circle, thereby respecting the intrinsic geometry of directional data. T otal sliced W asserstein distance W e introduce the total spherical sliced W asserstein distance d as a measure of the distribution shift between a source of unpaired data D = ( X, Y ) and the paired dataset D p = ( A, B ) . Using the spherical sliced W asserstein distance in place of the standard sliced W asserstein distance ensures that the resulting distances between text distrib utions and between image distributions are computed on a comparable scale, enabling fair comparison across modalities. The total SSW distance is deﬁned as d ( D , D p ) = SSW ( X, A ) + SSW ( Y , B ) . (22) Theoretical Results. W e now present the theoretical contribution underlying our proposed di vergence and its efﬁcient differentiation. Throughout, we work with an afﬁnity matrix K ∈ R n × n rather than a cost matrix, following the con vention C = − K for consistency with the rest of the paper . For an y transport plan T ∈ Π n , we deﬁne the entropic O T objective W ϵ ( T , K ) = −⟨ T , K ⟩ + ϵH ( T ) , (23) and the associated optimal value W ϵ ( K ) = min T ∈ Π n W ϵ ( T , K ) . (24) Finally , we denote OT ϵ ( K ) = arg min T ∈ Π n W ϵ ( T , K ) (25) the corresponding optimal transport plan. Importantly , we recall the follo wing fundamental result in entropic optimal transport states that there exist dual potentials u, v ∈ R n such that the optimal transport plan admits the decomposition log O T ϵ ( K ) = u 1 ⊤ + K ϵ + 1 v ⊤ , (26) see e.g. Peyr ´ e et al. ( 2019 ). This characterization allo ws us to establish the following lemma. Lemma C.3. F or any transport plan T ∈ Π n , ⟨ T , log O T ϵ ( K ) ⟩ = ⟨ T , K ⟩ + W ϵ ( K ) ϵ . (27) 23 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Pr oof. For any T ∈ Π n , we hav e ⟨ T , log O T ϵ ( K ) ⟩ = ⟨ T , u 1 ⊤ ⟩ + ⟨ T , 1 v ⊤ ⟩ + 1 ϵ ⟨ T , K ⟩ . Since T is bistochastic, ⟨ T , u 1 ⊤ ⟩ = ⟨ 1 , u ⟩ and ⟨ T , 1 v ⊤ ⟩ = ⟨ 1 , v ⟩ , yielding ⟨ T , log O T ϵ ( K ) ⟩ = ⟨ 1 , u + v ⟩ + 1 ϵ ⟨ T , K ⟩ . In particular , setting T = OT ϵ ( K ) yields H (OT ϵ ( K )) = ⟨ 1 , u + v ⟩ + 1 ϵ ⟨ OT ϵ ( K ) , K ⟩ . which recov ers a classical duality result ⟨ 1 , u + v ⟩ = W ϵ ( K ) ϵ . which giv es the result. Our main theoretical result follows by combining Lemma C.3 with the env elope theorem, yielding an explicit expression for the gradient of the proposed div ergence. Theorem C.4. F or any transport plan T ∈ Π n , ∇ K KL( T ∥ OT ϵ ( K )) = OT ϵ ( K ) − T ϵ . (28) Pr oof. By deﬁnition, KL( T ∥ OT ϵ ( K )) =  T , log T OT ϵ ( K )  = ⟨ T , log T ⟩ − ⟨ T , log OT ϵ ( K ) ⟩ and only the second term depends on K . Dif ferentiating yields ∇ K KL( T ∥ OT ϵ ( K )) = −∇ K ⟨ T , log O T ϵ ( K ) ⟩ . Applying Lemma C.3 giv es ∇ K KL( T ∥ OT ϵ ( K )) = − 1 ϵ ∇ K  ⟨ T , K ⟩ + W ϵ ( K )  . Since W ϵ ( K ) is deﬁned as the minimum of W ϵ ( T , K ) ov er T ∈ Π n which is a strongly conv ex problem, the en velope theorem implies ∇ K W ϵ ( K ) = ∇ K W ϵ (OT ϵ ( K ) , K ) = − OT ϵ ( K ) from which the result follows. W e note that a related deriv ation is presented in this blog post ( Assel , 2024 ), which draws an insightful connection to the Monge Gap regularizer introduced in ( Uscidda & Cuturi , 2023 ). C.3. Centered K ernel Alignment (CKA) Centered Kernel Alignment (CKA) ( Cristianini et al. , 2001 ) is a widely used measure of similarity between representation spaces, deﬁned in terms of their associated kernel (or Gram) matrices. Let H = I n − 1 n 11 ⊤ denote the centering matrix and ∥ · ∥ F the Frobenius norm. Gi ven two k ernel matrices K 1 , K 2 ∈ R n × n , CKA is deﬁned as CKA( K 1 , K 2 ) = ⟨ K 1 H , H K 2 ⟩ p ⟨ K 1 H , H K 1 ⟩ ⟨ K 2 H , H K 2 ⟩ . (29) For the sake of completeness we no w share a few classical results re garding CKA. 24 Semi-Supervised Alignment of Unimodal V ision and Language Models via Optimal T ransport Ker nel centering. The matrix H plays the role of centering the data in feature space. W e deﬁne the center ed kernel as ¯ K = H K H . (30) This operation corresponds to centering the underlying representations before computing pairwise similarities. Indeed, in the linear case where K = X X ⊤ for data matrix X ∈ R n × d , we hav e ¯ K = ¯ X ¯ X ⊤ , (31) where ¯ X denotes the centered features ¯ X i = X i − 1 n P n j =1 X j i.e. ¯ X = H X . CKA as a cosine afﬁnity . A well known property of CKA is that it can be interpreted as a cosine similarity between centered kernels, vie wed as vectors in R n 2 . Proposition C.5. Let ¯ K 1 = H K 1 H and ¯ K 2 = H K 2 H . Then CKA can be written as CKA( K 1 , K 2 ) = k  v ec( ¯ K 1 ) , vec( ¯ K 2 )  , (32) wher e k ( · , · ) denotes the cosine afﬁnity and v ec( · ) denotes matrix vectorization. Pr oof. Recall that H = I n − 1 n 11 ⊤ is symmetric and idempotent, i.e., H ⊤ = H and H 2 = H . W e compute ⟨ H K 1 H , H K 2 H ⟩ = tr( H K 1 H H K 2 H ) = tr( H K 1 H K 2 H ) = tr( K 1 H K 2 H ) = ⟨ K 1 H , H K 2 ⟩ , where we used cyclic in variance of the trace and the idempotence of H . In particular , setting K 1 = K 2 = K yields ⟨ H K H , H K H ⟩ = ∥ H K H ∥ 2 F . (33) Combining these identities prov es that CKA is exactly the cosine similarity between the v ectorized centered kernels. Computational Complexity . W e conclude this section by providing the computational complexity of CKA for linear kernels, as considered in this work. Proposition C.6. Assume that K 1 = X 1 X ⊤ 1 with X 1 ∈ R n × d 1 , K 2 = X 2 X ⊤ 2 with X 2 ∈ R n × d 2 , and denote d = max( d 1 , d 2 ) . Then the memory complexity of computing CKA( K 1 , K 2 ) is O  nd + d 2  . Pr oof. Assume that X 1 and X 2 are centered, which can be done in O ( nD ) time and memory . Using the identities establ ished abov e, we hav e ⟨ K 1 H , H K 2 ⟩ = ⟨ X 1 X ⊤ 1 , X 2 X ⊤ 2 ⟩ = ⟨ X ⊤ 1 X 2 , X ⊤ 1 X 2 ⟩ = ∥ X ⊤ 1 X 2 ∥ 2 F . Thus, computing the numerator only requires storing the d 1 × d 2 matrix X ⊤ 1 X 2 . Similarly , ∥ K 1 H ∥ 2 F = ∥ X ⊤ 1 X 1 ∥ 2 F , ∥ K 2 H ∥ 2 F = ∥ X ⊤ 2 X 2 ∥ 2 F , which require storing only the d 1 × d 1 and d 2 × d 2 Gram matrices, respectiv ely . Therefore, the ov erall memory complexity is dominated by storing X 1 , X 2 and the associated Gram matrices, yielding O ( nD + d 2 ) , as claimed. 25

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment