Understanding-informed Bias Mitigation for Fair CMR Segmentation

Understanding-informed Bias Mitigation for Fair CMR Segmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Artificial intelligence (AI) is increasingly being used for medical imaging tasks. However, there can be biases in AI models, particularly when they are trained using imbalanced training datasets. One such example has been the strong ethnicity bias effect in cardiac magnetic resonance (CMR) image segmentation models. Although this phenomenon has been reported in a number of publications, little is known about the effectiveness of bias mitigation algorithms in this domain. We aim to investigate the impact of common bias mitigation methods to address bias between Black and White subjects in AI-based CMR segmentation models. Specifically, we use oversampling, importance reweighing and Group DRO as well as combinations of these techniques to mitigate the ethnicity bias. Second, motivated by recent findings on the root causes of AI-based CMR segmentation bias, we evaluate the same methods using models trained and evaluated on cropped CMR images. We find that bias can be mitigated using oversampling, significantly improving performance for the underrepresented Black subjects whilst not significantly reducing the majority White subjects’ performance. Using cropped images increases performance for both ethnicities and reduces the bias, whilst adding oversampling as a bias mitigation technique with cropped images reduces the bias further. When testing the models on an external clinical validation set, we find high segmentation performance and no statistically significant bias.


💡 Research Summary

This paper investigates ethnic bias in cardiac magnetic resonance (CMR) segmentation models and evaluates several bias‑mitigation strategies, both generic and informed by recent insights into the root cause of the bias. Using the UK Biobank, the authors constructed a highly imbalanced training set (15 Black vs. 4 221 White subjects) to ensure a measurable bias, and reserved the remaining subjects for internal validation. An external clinical cohort from St. Thomas’ Hospital (84 White, 30 Black) served as a domain‑shift test set. The baseline model is a 2‑D nnU‑Net trained with the standard combined cross‑entropy‑Dice loss. Three generic mitigation techniques are examined: (1) Oversampling – minority‑group samples are duplicated so that each training batch contains equal numbers of Black and White subjects; (2) Importance re‑weighting – group‑specific weights inversely proportional to group size are applied to the loss; (3) Group Distributionally Robust Optimization (Group DRO) – the loss is reformulated to minimise the worst‑performing group’s average cross‑entropy loss. The authors also test pairwise combinations of these methods.

Crucially, recent work identified that features outside the heart drive the biased representations. To address this, the authors introduced a cropping preprocessing step that removes all non‑cardiac tissue from the images, thereby forcing the network to learn only cardiac‑relevant patterns. They trained the same set of models on cropped images and on cropped + oversampling, etc., to assess whether knowledge of the bias source yields additional gains.

Results on the internal test set show that the baseline model achieves a Dice of ~0.84 for Black subjects and ~0.89 for White subjects (ΔDice ≈ 0.05). Oversampling alone raises the Black Dice to ~0.88 without harming White performance, shrinking ΔDice to ~0.01 – the most substantial single improvement. Re‑weighting and Group DRO each provide modest gains (Black Dice ≈ 0.86–0.85) but do not outperform oversampling. Combining oversampling with Group DRO yields a slight further increase but still trails the simple oversampling approach.

When training on cropped images, both ethnic groups improve (Dice ≈ 0.90) and the bias gap halves (ΔDice ≈ 0.03). The best result is achieved with cropped + oversampling, where Black Dice reaches ~0.92 and White Dice ~0.93, reducing ΔDice to ≤ 0.01. External validation confirms that all mitigation strategies retain high segmentation quality (Dice > 0.90) and that no statistically significant ethnic disparity remains.

The authors conclude that (i) straightforward oversampling of the minority group is highly effective for this task, (ii) removing non‑cardiac image regions—i.e., addressing the identified root cause—further enhances both accuracy and fairness, and (iii) more sophisticated loss‑based methods such as Group DRO do not provide additional benefit in the current setting. Limitations include the very small number of Black training subjects, which may affect generalisability, and the need for an additional cropping pipeline in clinical practice. Future work should explore automated cardiac ROI detection, multi‑protected‑attribute mitigation (e.g., sex, age), and validation on larger, more diverse multi‑ethnic cohorts.


Comments & Academic Discussion

Loading comments...

Leave a Comment