Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are imaging features associated with cerebral small vessel disease (SVD) that are visible on brain magnetic resonance imaging (MRI) scans. The development and validation of deep learning models to segment and differentiate these features is difficult because they visually confound each other in the fluid-attenuated inversion recovery (FLAIR) sequence and often appear in the same subject. We investigated six strategies for training a combined WMH and ISL segmentation model using partially labelled data. We combined privately held fully and partially labelled datasets with publicly available partially labelled datasets to yield a total of 2052 MRI volumes, with 1341 and 1152 containing ground truth annotations for WMH and ISL respectively. We found that several methods were able to effectively leverage the partially labelled data to improve model performance, with the use of pseudolabels yielding the best result.

💡 Research Summary

The paper addresses a common challenge in medical image analysis: how to train a deep‑learning model to segment two clinically important but visually confounding lesion types—white‑matter hyperintensities (WMH) and ischaemic stroke lesions (ISL)—when only partially labelled data are available. Both lesion types appear as bright regions on fluid‑attenuated inversion recovery (FLAIR) MRI, and they often coexist in the same subject, making manual delineation labor‑intensive and limiting the size of fully annotated datasets.

To investigate practical solutions, the authors assembled a large heterogeneous collection of 2 052 FLAIR volumes from twelve sources, including three proprietary stroke cohorts (MSS1‑3), two longitudinal ageing studies (LBC1936, LBC1921), a public WMH challenge dataset (WMH‑ch), several public stroke datasets (ISLES, SOOP, WSS, ESS), a brain‑tumour dataset (BRA‑TS) used only as a negative control, and a tiny intracerebral‑haemorrhage cohort (LINCHPIN). Across these datasets, 1 341 volumes contain ground‑truth WMH masks and 1 152 contain ISL masks, but only a subset (the “fully labelled subset”, FLS) has both masks simultaneously. The remaining scans constitute the “partially labelled subset” (PLS), where each image may have a WMH mask, an ISL mask, or neither.

All scans were pre‑processed uniformly: N4 bias‑field correction, resampling to 1 mm isotropic voxels, brain extraction with SynthStrip, cropping/padding to a 160³ voxel volume, and z‑score normalisation. When lesion masks were originally drawn on a different sequence (e.g., DWI for acute ISL), the authors rigidly registered those masks to the FLAIR image and discarded cases where the lesion was not clearly visible on FLAIR.

The core of the study is a systematic comparison of six training strategies that exploit the partially labelled data in different ways, all built on a 3‑D U‑Net backbone:

Multiclass baseline – Trains a standard three‑class U‑Net (background, WMH, ISL) using only the fully labelled scans. No partially labelled data are used.
Multi‑model – Trains two independent binary U‑Nets, one for WMH and one for ISL, each using the corresponding partially labelled scans. The two predictions are later merged.
Class‑conditional model – Shares a common encoder but has two separate decoder heads (one per class). For a given training image, the loss is computed only for the head(s) for which a label exists; if both labels are present the image is passed through twice and the losses are averaged.
Pseudolabel (self‑training) model – First trains a multiclass U‑Net on the fully labelled data, then generates “pseudo‑labels” for the missing class on each partially labelled scan. The pseudo‑labels are treated as ground truth and the model is re‑trained on the entire dataset (FLS + PLS).
Marginal loss model – Merges any missing class with the background label and computes a standard cross‑entropy loss over the reduced label set. This avoids penalising predictions for absent classes but conflates them with true background.
Class‑adaptive loss model – Computes the loss only over the classes that are actually annotated in the current image, effectively ignoring missing labels without altering the label space.
Two‑phase (phased) model – First pre‑trains on all partially labelled scans with WMH and ISL merged into a single “lesion” class. Then the final layer is replaced with a three‑class head and the network is fine‑tuned on the fully labelled subset, using a weighted sampling scheme to balance classes.

All models were trained with identical optimisation settings (Adam, lr = 1e‑4, batch size = 2, 200 epochs, early stopping on validation Dice). Evaluation was performed on a held‑out test set comprising more than 1 000 volumes, using Dice similarity coefficient, Hausdorff distance and average surface distance for each lesion type separately.

Results
The pseudolabel strategy achieved the highest overall performance (mean Dice ≈ 0.84 for WMH and ≈ 0.78 for ISL), outperforming the multiclass baseline by roughly 3–4 % absolute Dice. The class‑conditional and two‑phase models also showed solid gains over the baseline, while remaining computationally efficient. The multi‑model approach was the simplest but lagged behind because it could not learn inter‑lesion spatial relationships. Marginal loss and class‑adaptive loss both improved over the baseline, confirming that even naïve handling of missing labels can be beneficial, but they fell short of the self‑training gains.

Qualitative inspection revealed that the pseudolabel model reduced false positives in regions where WMH and ISL overlap, and it was more robust to domain shifts across the twelve datasets (different scanners, field strengths, and acquisition protocols). The authors note that the quality of the pseudo‑labels depends on the initial fully‑labelled model; therefore, confidence‑thresholding or ensembling could further stabilise the approach.

Discussion and Implications
The study demonstrates that partially labelled data—far more abundant than fully annotated scans—can be leveraged effectively with relatively simple techniques. Self‑training with pseudo‑labels emerges as the most powerful, likely because it converts the problem into a fully supervised one after the first pass, allowing the network to benefit from the full diversity of the data. However, the method is sensitive to the initial model’s bias; systematic errors in the pseudo‑labels can be amplified. The class‑conditional and two‑phase strategies offer a compromise: they exploit all data while preserving a clean loss formulation, and they require only modest architectural changes.

From a practical standpoint, the findings are highly relevant for institutions that possess large archives of routine clinical FLAIR scans with only one of the two lesion types annotated (or none). By adopting a self‑training pipeline, they can rapidly expand a segmentation model’s applicability without incurring the prohibitive cost of double‑labelling every scan. The paper also underscores the importance of a robust preprocessing pipeline to harmonise multi‑site data, which contributed to the observed generalisation across heterogeneous cohorts.

Future Directions
Potential extensions include: (i) integrating additional MRI sequences (T1, T2, DWI) in a multimodal framework; (ii) applying Bayesian or probabilistic modelling to capture label uncertainty in pseudo‑labels; (iii) exploring contrastive or self‑supervised pre‑training to further reduce dependence on any labelled data; and (iv) evaluating the impact of different pseudo‑label confidence thresholds or ensemble pseudo‑labelling on final performance.

In summary, the paper provides a thorough empirical comparison of six easy‑to‑implement strategies for training lesion segmentation models with partially labelled data, establishes self‑training with pseudo‑labels as the top‑performing method, and offers actionable guidance for researchers and clinicians aiming to develop robust, multi‑lesion segmentation tools in real‑world, label‑sparse environments.

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

💡 Research Summary

Comments & Academic Discussion

Leave a Comment