Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation

Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key …

Authors: Marceau Lafargue-Hauret, Raghav Mehta, Fabio De Sousa Ribeiro

Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation
PIXEL-LEVEL COUNTERF A CTU AL CONTRASTIVE LEARNING FOR MEDICAL IMA GE SEGMENT A TION Mar ceau Lafargue-Haur et, Raghav Mehta, F abio De Sousa Ribeir o, M ´ elanie Rosche witz, Ben Glock er Department of Computing, Imperial College London, UK ABSTRA CT Image segmentation relies on lar ge annotated datasets, which are expensi ve and slo w to produce. Silver-standar d (AI- generated) labels are easier to obtain, but they risk intro- ducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastiv e learning with counterfactual generation impro ves representation learning for classification b ut does not readily extend to pixel-le vel tasks. W e propose a pipeline combining counterfactual generation with dense contrasti ve learning via Dual-V iew (D VD-CL) and Multi-V iew (MVD-CL) methods, along with supervised variants that utilize a vailable silv er- standard annotations. A ne w visualisation algorithm, the Color-coded High Resolution Ov erlay map (CHR O-map) is also introduced. Experiments sho w annotation-free D VD-CL outperforms other dense contrasti ve learning methods, while supervised variants using silver -standard labels outperform training on the silver -standard labeled data directly , achie ving ∼ 94% DSC on challenging data. These results highlight that pixel-le vel contrastiv e learning, enhanced by counterfactu- als and silver -standard annotations, improves rob ustness to acquisition and pathological variations. Index T erms — Contrastiv e learning, Image segmenta- tion, Counterfactual generation, V isualization, Chest X-Ray 1. INTRODUCTION Contrastiv e learning [ 1 , 2 , 3 , 4 ] has become a popular solu- tion to exploit unlabeled data. It aims to learn representa- tions by bringing similar samples (positi ves) closer in the em- bedding space while pushing dissimilar samples (ne gati ves) apart. T ypically , this is achiev ed by defining pairs or sets of augmented views of the same image as positi ves, and different images as negativ es, and training the model with a contrasti ve loss such as the InfoNCE or NT -Xent objective [ 1 , 3 ]. This framew ork encourages the encoder to extract features that are in v ariant to augmentations and discriminative across samples, enabling effecti ve representation learning without labels. Classical contrastive methods, such as SimCLR [ 3 ] or MoCo [ 4 ], operate at the image le vel, producing a single em- bedding per image. While this is ef fectiv e for classification or retriev al tasks, it limits their ability to capture spatially localized information required for dense prediction tasks such as segmentation. T o address this limitation, dense con- trastiv e learning [ 5 , 6 , 7 ] extends the contrastiv e objecti ve to the pix el or patch le vel. Instead of comparing global image embeddings, pixel-lev el embeddings are contrasted across multiple augmented vie ws of the same image, enabling spa- tially consistent and semantically rich representations that can be lev eraged for segmentation or detection tasks. Contrastiv e learning has also been used in supervised or semi-supervised settings [ 1 , 8 ]. Howe ver , the standard data augmentation used in contrastiv e learning methods (such as rotation, cropping, and blurring) does not capture the comple x and challenging dataset shifts that arise in real-world applications. This limitation could be ov ercome by counterfactual gen- eration, which seeks to produce realistic and semantically meaningful v ariations in the data. Counterfactual genera- tion relies on the frame work of Structural Causal Models (SCMs), which describe a system as a set of causal v ariables connected by deterministic functions and exogenous noise terms. By interv ening on a specific v ariable (e.g., changing the presence of pleural in chest x-ray) while keeping other factors fix ed, one can generate a counterfactual version of the data that reflects “what if?” that change had occurred. Deep SCMs (DSCMs) [ 9 , 10 ] extend this idea by parameterizing causal mechanisms with neural networks, enabling counter - factual image synthesis directly in high-dimensional spaces. The dev elopment of counterfactual generation [ 9 , 11 , 10 ] has found use in improving model robustness [ 12 , 13 ] as well as in contrastiv e learning [ 14 ] in image classification models. In this work, we leverage counterfactual generation to en- hance contrastive learning for dense contrastive objectiv es, marking the first exploration of this setting. W e focus on lung segmentation from Chest X-rays in the presence of Pleural Effusion (PE). PE renders a large section of the lungs opaque, often leading to undersegmentation of the lungs [ 13 ]. Our contributions can be summarized as follo ws: • Four novel dense-contrasti ve learning methods: Dual- V iew (D VD-CL) and Multi-V ie w (MVD-CL), plus their silver-standar d label supervised v ariants. • Integration of counterfactual augmentation for inv ariant representation learning. • CHR O-map for visualising pixel-le vel embeddings. These innovations enhance the interpretability and robust- Fig. 1 : Overvie w of the proposed methods. V ie ws are formed through scanner (SC) and pleural ef fusion (PE) counterf actuals, in combination with a traditional augmentation pipeline. (S-)DVD-CL computes three dense similarity computations between the anchor vie w and each of the target vie ws, and av erage results. (S-)MVD-CL computes similarity between all vie ws at once. ness of medical image segmentation using unlabeled data and may provide components for constructing foundational mod- els. 2. METHODS In contrasti ve learning, data augmentations play a crucial role in making models rob ust to dataset shifts that arise in real- world applications. Howe ver , such shifts are often complex and dif ficult to capture using standard augmentations, such as color jittering or random cropping. T o address this limita- tion, counterfactual generation aims to simulate realistic and semantically meaningful variations in the data. 2.1. Counterfactual Generation W e train a counterfactual generative Hierarchical V ariational AutoEncoder (HV AE) [ 10 , 15 ] using the same method as the authors. W e use se x, scanner type, and the presence of pleural effusion (PE) as parents in the causal graph of SCM for Chest X-Ray . W e generate three counterfactuals: (i) only the scan- ner is changed, (ii) only the presence of the PE is changed, and (iii) both the scanner and the presence of PE are changed. Each of these counterfacutals and the base image is further augmented with random rotation, cropping, and image effects (color jittering, Gaussian blurring, and solarization). 2.2. Dense Counterfactual Contrastive Learning W e propose four different dense contrasti ve learning meth- ods that leverage image counterfactuals for data augmentation (see Fig. 1 ). They consist of dual- and multi-vie w approaches, each av ailable in both supervised and unsupervised variants. Dual-V iew Dense Contrastive Learning (D VD-CL) ex- tends the pixel-lev el contrasti ve framework of V ADeR [ 5 ] to support multiple views. In V ADeR, tw o augmented views of an image are used: pixels at the same spatial location form positiv e pairs, while all other pixel combinations serve as negati ves. In DVD-CL, the non-counterfactual view serves as the anchor view . The remaining vie ws, counterfactual im- ages generated by the HV AE, serv e as tar get views. For each target view , positi ve pixel pairs are defined as those sharing the same spatial location in the anchor and tar get view , while all other pix el combinations are negati ves. The NT -Xent loss is a veraged o ver both directions, first treating the anchor as the reference, then the tar get. The overall contrasti ve loss is calculated by averaging across all tar get vie ws (three in our experiments, although an y number of views are supported). Multi-V iew Dense Contrastiv e Learning (MVD-CL) con- siders all vie ws simultaneously and forms pixel pairs follo w- ing the same procedure as in DVD-CL, but extended across all av ailable views. The contrasti ve loss is then computed ov er all pairwise pixel similarities, enabling joint optimiza- tion across the entire multi-view set. Supervised DVD-CL (S-D VD-CL) draws inspiration from prior pixel-lev el contrastiv e learning methods that lev erage annotations to define positive and negati ve pairs [ 8 , 16 ]. In this setting, pixels belonging to the same class are treated as positiv es (ex., left lung), while pixels from different classes are treated as negati ves (e x., right lung). Using this supervision-based pairing strategy , we compute the con- trastiv e loss in the same manner as in D VD-CL, ev aluating the loss o ver two views at a time and averaging the results across all vie w pairs. T o ensure meaningful supervision, only non-background pixels are used as anchors. Supervised MVD-CL (S-MVD-CL) again considers all views simultaneously , forming pixel pairs with annotations in the same manner as in S-D VD-CL. The contrastive loss is computed o ver all pix el similarities and a veraged across the full set of views. 2.3. CHRO-map W e introduce a Color-coded High-Resolution Overlay map- ping (CHR O-map) to visualize the network’ s output embed- dings. First, the high-dimensional embeddings are projected (a) D VD-CL (b) MVD-CL (c) S-D VD-CL (d) S-MVD-CL Fig. 2 : Output CHR O-maps of the four different methods. Similar colors indicate similar encodings. W e observ e that MVD-CL fails to capture meaningful representations, whereas D VD-CL effecti vely encodes pixels based on their relative position to the spine, which can be further fine-tuned for segmentation. S-D VD-CL and S-MVD-CL manage to sharply distinguish both lungs. to a two-dimensional space using UMAP [ 17 ]. W e then com- pute the minimum volume enclosing ellipse around the pro- jected embeddings and determine the affine transformation that maps this ellipse to the unit circle. This transforma- tion is applied to all embeddings, which are subsequently as- signed HSV colors based on their polar coordinates, using the angle θ as the hue and the radius r as the v alue. Finally , each color is ov erlaid on its corresponding pixel in the im- age to produce the visualization. CHRO-map provides an in- terpretable, high-resolution 2D visualization of dense feature embeddings, where semantic proximity in the latent space corresponds to color similarity in the visualization. This en- ables qualitative assessment of representation clustering and counterfactual in variance (Fig. 2 ). 3. IMPLEMENT A TION DET AILS Dataset. W e utilize the publicly av ailable PadChest [ 18 ] dataset for experimentation ( ∼ 60k training and ∼ 17k val- idation images). W e use silver-standar d CheXmask [ 19 ] annotations for the supervised contrastiv e learning v ariants. The goal is to generate (left and right) lung segmentation from Chest X-ray images of healthy patients and patients with Pleural Ef fusion (PE). Furthermore, we utilize 70 manu- ally annotated images (20 healthy and 50 PE) for fine-tuning and validating se gmentation models [ 13 ]. Network . Instead of pre-training only the encoder of a U-Net model, as is done in other contrasti ve learning works [ 7 , 6 ], we pre-train the full encoder-decoder structure of a standard U-Net model with a ResNet50 encoder . Precise Pixel T racking. W e model each geometric trans- formation (e.g., rotation, cropping, scaling) using a homoge- neous transformation matrix. This enables us to maintain e x- act pixel-lev el correspondences across augmented views, al- lowing flexible formation of pixel-lev el positi ve pairs under any af fine transformation. Sampling. As computing the contrastive losses over all pix- els is computationally infeasible, we randomly sample 1,000 pixels from each view . Sampling is restricted to the intersec- tion of acti ve vie ws, specifically the anchor and target vie ws for (S-)D VD-CL, and all views for (S-)MVD-CL. This sam- pling strate gy is particularly suitable when the objects to be T able 1 : Ev aluation of the learned embedding using the pro- posed contrastiv e learning methods before fine-tuning. d/ σ ( ↑ ) K-means-purity ( ↑ ) D VD-CL 0.83 0.6979 MVD-CL 1.39 0.6391 S-D VD-CL 17.07 0.9581 S-MVD-CL 12.16 0.9705 segmented are located near the center of the image, as pix- els closer to the borders are less likely to be observed during pre-training. 4. RESUL TS 4.1. Latent Space Analysis W e e valuate the learned embedding using the proposed meth- ods, both qualitati vely (with CHR O-Maps) and quantitativ ely . For quantitati ve results, the output embeddings are clustered into three clusters using K-means clustering, and the purity of each cluster was calculated using the CheXmask labels. Pu- rity is calculated as a ratio of the majority-labeled pixel in a cluster ov er the total number of pixels in the cluster [ 20 ]. A higher value of purity represents better clustering. In addi- tion, utilizing the same CheXmask labels, we calculate the ratio of inter-class distance and intra-class standard deviation, which should increase as the model clusters pix els from the same class together [ 21 ]. From both Fig. 2 and T able 1 , we observe that in unsupervised v ariants, for D VD-CL, no sharp clusters are visible; ho wev er , each lung occupies a distinct re- gion of the latent space. The CHRO-maps further re veal that the model encodes pixels according to their spatial relation- ship to the spine, indicating a degree of positional awareness in the learned representations. In contrast, MVD-CL fails to learn meaningful semantic structure, as evidenced by its poor clustering performance in T able 1 . W e also observe that the supervised variants (S-D VD-CL and S-MVD-CL) learn well-defined clusters corresponding to the three anatomical classes. In the CHR O-maps (Fig. 2 ), the boundaries between the lungs and the background are also assigned distinct colors, T able 2 : Performance (mean ± std ) of the various contrastiv e pre-training methods after fine-tuning across fiv e folds using Dice Scores (%), 95% Hausdorf f Distance (HD 95 ), and A verage Surface Distance (ASD). F or Dice Scores, we report the overall performance (DSC), specifically for healthy patients (DSC NF ) and for patients with Pleural Ef fusion (DSC PE ). Overall best results are highlighted in bold . Best results for both unsupervised and supervised methods are also highlighted. Pre-training Method DSC ( ↑ ) DSC NF ( ↑ ) DSC PE ( ↑ ) HD 95 ( ↓ ) ASD ( ↓ ) None 87.05 ± 4.34 93.59 ± 1.28 84.76 ± 1.28 21.9 ± 8.0 5.39 ± 2.05 Unsupervised SimCLR [ 3 ] 90.90 ± 2.94 96.71 ± 0.28 88.61 ± 4.15 11.9 ± 5.3 3.64 ± 1.63 V ADeR [ 5 ] 92.26 ± 2.66 97.22 ± 0.16 90.33 ± 3.68 10.4 ± 3.7 2.79 ± 1.14 D VD-CL ( ours ) 92.76 ± 2.23 97.36 ± 0.14 91.08 ± 3.18 10.5 ± 3.6 2.86 ± 1.04 MVD-CL ( ours ) 92.27 ± 2.73 97.18 ± 0.25 90.36 ± 3.79 9.1 ± 4.9 2.61 ± 1.16 Supervised CheXMask 93.65 ± 2.33 97.81 ± 0.20 92.02 ± 3.23 7.3 ± 1.8 2.46 ± 1.26 SSDCL [ 8 ] 92.91 ± 2.77 97.58 ± 0.17 91.10 ± 3.84 15.3 ± 5.0 3.31 ± 1.20 S-D VD-CL ( ours ) 93.15 ± 2.61 97.62 ± 0.11 91.43 ± 3.65 8.0 ± 3.9 2.96 ± 1.97 S-MVD-CL ( ours ) 93.93 ± 1.92 97.75 ± 0.09 92.43 ± 2.67 7.3 ± 3.4 2.05 ± 0.74 Fig. 3 : Qualitati ve results for lung segmentation using v ari- ous pre-training methods on PE image. Supervised me thods (top ro w) sho w better performance compared to unsupervised methods (bottom ro w). W e also observe that the proposed methods outperform their counterpart baselines. suggesting that the model captures inter-class boundary infor- mation in its latent representations. In T able 1 , we observe that both supervised variants achiev e high clustering purity and distance to variance ratio. 4.2. Fine-tuning Evaluation W e fine-tune all pre-trained models on the manually anno- tated PadChest subset [ 13 ] using a fi ve-fold cross-v alidation approach. As sho wn in T able 2 and Fig. 3 , all proposed pre- training methods outperform their respectiv e baselines, indi- cating that pixel-le vel counterfactual contrasti ve pre-training produces more transferable representations for segmenta- tion. Among the unsupervised methods, DVD-CL achie ves the best overall performance, outperforming SimCLR and V ADeR. The use of counterfactual augmentations further improv es generalisation as reflected by the lower varianc e across folds. Both supervised variants, S-DVD-CL and S- MVD-CL, further improve performance, demonstrating the benefit of incorporating label information when av ailable. Notably , S-MVD-CL surpasses the silver-standar d CheX- Mask pre-trained model, achie ving an average DSC of ∼ 94% on challenging manually annotated samples. This highlights that dense, multi-view contrasti ve learning with pixel-lev el supervision can yield representations that transfer well to difficult samples, e ven from silver-standar d pre-training data. 5. CONCLUSION In this work, we introduced a family of pixel-le vel counterfac- tual contrasti ve learning methods designed for medical image segmentation. By combining counterfactual image generation with dense contrasti ve objecti ves, our approach enables the learning of representations that are both spatially consistent and inv ariant to confounding imaging factors such as scan- ner type or disease presence. Through e xtensiv e e xperiments, we demonstrated that our unsupervised methods outperform existing self-supervised baselines, while the supervised v ari- ants further impro ve performance and robustness. Notably , both S-D VD-CL and S-MVD-CL outperform supervised seg- mentation pre-training when gi ven access to identical data. The proposed CHR O-map visualisation also pro vides an in- tuitiv e way to interpret pixel embeddings and assess repre- sentation quality . Overall, our results demonstrate that inte- grating counterfactual reasoning into dense contrastiv e learn- ing presents a promising approach to developing more inter- pretable and rob ust medical se gmentation models. Howe ver , our counterfactual generation approach assumes the causal graph is kno wn and that counterfactuals are identifiable from observed data. These assumptions may not hold in practice, affecting the causal validity of our estimates [ 22 ]. Future work may explore alternativ e counterfactual generation meth- ods to address these limitations, as well as scaling the frame- work to 3D modalities or semi-supervised settings. 6. COMPLIANCE WITH ETHICAL ST AND ARDS This study uses secondary , fully anonymised data which is publicly av ailable and is exempt from ethical approv al. 7. A CKNO WLEDGMENTS This project was partially supported by the Royal Academy of Engineering (Kheiron/RAEng Research Chair), the UKRI AI programme, and the EPSRC, for CHAI - EPSRC Causal- ity in Healthcare AI Hub (grant no. EP/Y028856/1), and the European Union’ s Horizon Europe research and innov ation programme under grant agreement 101080302. 8. REFERENCES [1] K. Sohn, “Improved Deep Metric Learning with Multi- class N-pair Loss Objecti ve, ” in NeurIPS , 2016, v ol. 29, pp. 1857–1865. [2] A. v an den Oord, Y . Li, and O. V inyals, “Representa- tion Learning with Contrastiv e Predictive Coding, ” Jan. 2019. [3] T . Chen, S. K ornblith, and G. Norouzi, M.and Hinton, “ A Simple Framew ork for Contrasti ve Learning of V i- sual Representations, ” in ICML . Nov . 2020, pp. 1597– 1607, PMLR. [4] K. He, H. Fan, Y . W u, S. Xie, and R. Girshick, “Mo- mentum Contrast for Unsupervised V isual Representa- tion Learning, ” in CVPR , 2020, pp. 9726–9735. [5] P . O. O. Pinheiro, A. Almahairi, R. Benmalek, F . Golemo, and A. Courville, “Unsupervised Learning of Dense V isual Representations, ” in NeurIPS . 2020, vol. 33, pp. 4489–4500, Curran Associates, Inc. [6] X. W ang, R. Zhang, C. Shen, T . Kong, and L. Li, “Dense Contrastiv e Learning for Self-Supervised V isual Pre- T raining, ” in CVPR , 2021, pp. 3023–3032. [7] Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate Y ourself: Exploring Pixel-Lev el Consis- tency for Unsupervised V isual Representation Learn- ing, ” in CVPR , 2021, pp. 16679–16688. [8] X. Hu, D. Zeng, X. Xu, and Y . Shi, “Semi-supervised Contrastiv e Learning for Label-Ef ficient Medical Image Segmentation, ” in MICCAI , 2021, pp. 481–490. [9] N. Pawlo wski, D. C. Castro, and B. Glocker , “Deep structural causal models for tractable counterfactual in- ference, ” NeurIPS , vol. 33, pp. 857–869, 2020. [10] F . D. S. Ribeiro, T . Xia, et al., “High Fidelity Image Counterfactuals with Probabilistic Causal Models, ” in ICML . 2023, pp. 7390–7425, PMLR. [11] M. K ocaoglu, C. Snyder , A. G. Dimakis, and S. V ish- wanath, “CausalGAN: Learning Causal Implicit Gener - ativ e Models with Adversarial Training, ” in ICLR , 2018. [12] B. Chen, Y . Zhu, Y . Ao, S. Caprara, et al., “Generaliz- able Single-Source Cross-Modality Medical Image Se g- mentation via Inv ariant Causal Mechanisms, ” in W A CV , 2025, pp. 3592–3602. [13] R. Mehta, F . D. S. Ribeiro, T . Xia, M. Roschewitz, A. Santhirasekaram, D. C. Marshall, and Ben Glocker , “CF-Seg: Counterfactuals meet segmentation, ” in MIC- CAI . Springer , 2025, pp. 117–127. [14] M. Rosche witz, F . De Sous Ribeiro, T . Xia, G. Khara, and B. Glocker , “Robust image representations with counterfactual contrastive learning, ” Medical Imag e Analysis , p. 103668, 2025. [15] M. Monteiro, F . D. S. Ribeiro, N. Pawlo wski, D. C. Cas- tro, and B. Glocker , “Measuring Axiomatic Soundness of Counterfactual Image Models, ” in ICLR , 2023. [16] N. Dey , B. Billot, H. E. W ong, C. W ang, M. Ren, E. Grant, A. V . Dalca, and P . Golland, “Learning General-purpose Biomedical V olume Representations using Randomized Synthesis, ” in ICLR , 2024. [17] L. McInnes, J. Healy , and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ” Sept. 2020. [18] A. Bustos, A. Pertusa, J.-M. Salinas, and M. de la Iglesia-V ay ´ a, “PadChest: A large chest x-ray image dataset with multi-label annotated reports, ” Medical Im- age Analysis , v ol. 66, pp. 101797, 2020. [19] N. Gaggion, C. Mosquera, L. Mansilla, J. M. Saidman, M. Aineseder, D. H. Milone, and E. Ferrante, “CheX- mask: A large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images, ” Scientific Data , vol. 11, no. 1, pp. 511, 2024. [20] S. Chen, P . Gopalakrishnan, et al., “Speaker , en viron- ment and channel change detection and clustering via the bayesian information criterion, ” in Proc. DARP A br oadcast ne ws transcription and understanding work- shop . V irginia, USA, 1998, v ol. 8, pp. 127–132. [21] T . Cali ´ nski and J. Harabasz, “ A dendrite method for cluster analysis, ” Communications in Statistics-theory and Methods , vol. 3, no. 1, pp. 1–27, 1974. [22] F . De Sousa Ribeiro, A. Santhirasekaram, and B. Glocker , “Counterfactual identifiability via dynamic optimal transport, ” 2025.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment