When Unseen Domain Generalization is Unnecessary? Rethinking Data Augmentation

When Unseen Domain Generalization is Unnecessary? Rethinking Data A ugmentation Ling Zhang 1 , Xiaosong W ang 1 , Dong Y ang 1 , Thomas Sanford 3 , Stephanie Harmon 4 , Baris T urkbey 3 , Holger Roth 1 , Andriy Myronenko 2 , Daguang Xu 1 , and Ziyue Xu 1 1 NVIDIA, Bethesda MD 20814, USA 2 NVIDIA, Santa Clara CA 95051, USA 3 NIH, Bethesda MD 20892, USA 4 NCI, Bethesda MD 20892, USA Abstract. Recent advances in deep learning for medical image segmentation demonstrate expert-le vel accuracy . Howe ver , in clinically realistic environments, such methods hav e marginal performance due to differences in image domains, including dif ferent imaging protocols, device v endors and patient populations. Here we consider the problem of domain generalization, when a model is trained once, and its performance generalizes to unseen domains. Intuitiv ely , within a speciﬁc medical imaging modality the domain differences are smaller relative to natural images domain variability . W e rethink data augmentation for medical 3D images and propose a deep stacked transformations (DST) approach for domain generalization. Speciﬁcally , a series of n stacked transformations are applied to each image in each mini-batch during network training to account for the contri- bution of domain-speciﬁc shifts in medical images. W e comprehensively ev alu- ate our method on three tasks: segmentation of whole prostate from 3D MRI, left atrial from 3D MRI, and left ventricle from 3D ultrasound. W e demonstrate that when trained on a small source dataset, (i) on av erage, DST models on unseen datasets de grade only by 11% (Dice score change), compared to the con ventional augmentation (de grading 39%) and CycleGAN-based domain adaptation method (degrading 25%), (ii) when ev aluation on the same domain, DST is also better albeit only marginally . (iii) When training on large-sized data, DST on unseen domains reaches performance of state-of-the-art fully supervised models. These ﬁndings establish a strong benchmark for the study of domain generalization in medical imaging, and can be generalized to the design of robust deep segmenta- tion models for clinical deployment. 1 Introduction Practical application of AI medical imaging methods require accurate and robust perfor- mance on unseen domains, such as differences in acquisition protocols across different centers, scanner vendors, and patient populations (see Fig. 1). Unfortunately , labeled medical datasets are typically small and do not include sufﬁcient variability for robust deep learning training. The lack of large, div erse medical imaging datasets often lead to marginal deep learning model performance on new “unseen” domains, which limits their applications in clinical practice [9]. 2 L. Zhang, et al. (a ) Whole prostate MRI (b) Left atrial MRI (c) Left ventricle ultrasound Source Unseen A Unseen B Source Unseen A Unseen B Source Unseen B Unseen A Fig. 1. Medical image se gmentation in source and unseen domains (i.e., a speciﬁc medical imag- ing modality across different vendors, imaging protocols, and patient populations, etc.) for (a) whole prostate MRI, (b) left atrial MRI, and (c) left ventricle ultrasound. The illustrated images are processed with intensity normalization. T o improv e model performance on unseen domains, transfer learning methods at- tempt to ﬁne-tune a portion of a pre-trained netw ork giv en a small amount of annotated data from the unseen target domain. Transfer learning applications for medical 3D im- ages are often lacking quality pre-trained models (trained on a large amount of data). Domain adaption methods do not require annotations in the unseen domain, b ut usually require all source and target domain images be av ailable during training [11,1,10]. The assumption of a known target dataset is restrictiv e, and makes multi-site deployment impractical. Furthermore, due to medical data priv acy requirements, it is difﬁcult to collect both the source and target datasets beforehand. In the ﬁeld of medical imaging, we are usually faced with the difﬁcult situation where the training dataset is deriv ed from a single center and acquired with a speciﬁc protocol. In such situations, domain generalization methods seek a rob ust model, trained once, capable of generalizing well to unseen domains. In 2D computer vision applica- tions, researchers focused on various complexity of data augmentation to expand the av ailable data distribution. Speciﬁcally , data augmentation strategies are performed in input space [6] or during adversarial learning [7]. Compared to natural 2D images, 3D medical image domain variability is more compact. W ithin the same modality , e.g. T2 MRI or Ultrasound, images from different vendors (GE, Philips, Siemens), scanning protocols, and patient populations are visually different mainly in three aspects: image quality , image appearance and spatial shape (see Figure 1). Other imaging modalities, such as CT , generally hav e more consistent image characteristics. Motiv ated by the observed heterogeneity of 3D medical images, we propose a sys- tematic augmentation approach consisting of series of transformations to simulate do- main shift properties of medical imaging data. W e call this approach, Deep Stacked T ransformations (DST) augmentation. DST operates on the image space, where input images undergo nine stacked transformations. Each transform is controlled by two pa- rameters, which determine the probability and magnitude of the image transformation. As a backbone semantic segmentation network we use AH-Net [5]. Unseen Domain Generalization 3 In 3D medical imaging applications, the selection of image augmentations is often intuitiv e, random crop or ﬂip, inherited from 2D computer vision applications. Fur- thermore, the contribution of augmentation method is rarely ev aluated on the unseen domain. In this work, we comprehensively e valuate the effect of various data augmen- tation techniques on 3D segmentation generalization to the unseen domains. The ev al- uation tasks include segmentation of whole prostate from 3D MRI, left ventricle from 3D ultrasound, and left atrial from 3D MRI. For each task we hav e up to 4 different datasets to be able to train on one and ev aluate generalization to other datasets. The results and analysis • Rev eal the main factors causing domain shift in 3D medical imaging modalities. • Demonstrate that DST augmentation substantially outperforms con ventional aug- mentation and CycleGAN-based domain adaptation on unseen domains for both MRI and ultrasound. The generalization improvements are observed even on the same domain (albeit much less noticeable). • Giv en a larger training dataset, DST achiev es state-of-the-art segmentation accu- racy on unseen domains. 2 Methods T o improve generalization of 3D medical semantic segmentation method, we use a se- ries n stacked augmentation transforms τ ( . ) applied to input images during training. Each transformation is an image processing function with tw o hyper-parameters: prob- ability p and magnitude m . ( ˆ x s , ˆ y s ) = τ n p n ,m n ( τ n − 1 p n − 1 ,m n − 1 ( ...τ 1 p 1 ,m 1 ( x s , y s ))) (1) where x s , y s are input image and its corresponding label. Augmentation transforms al- ter the image quality , appearance, and spatial structure. Speciﬁcally DST consists of the following transforms: sharpening, blurring, noise, brightness adjustment, contrast change, perturbation, rotation, scaling, deformation, in addition to random cropping. In DST , transforms are in the order as described – performances of models are not sensitiv e to different orders. As we show in our experiments, augmenting image sets during training can result in models with more robust segmentations than if data pro- cessing/synthesis was performed at the inference stage. Fig. 2 shows some examples of DST augmentation in 3D MRI and ultrasound demonstrating ability to mimic image appearances in unseen domains with a giv en modality . Image Quality is related to sharpness , blurriness , and noise level of medical images. Blurriness is commonly caused by MR/ultrasound motion artifacts and resolution. Gaus- sian ﬁltering is used to blur the image, with a magnitude (Gaussian std) ranging between [0.25, 1.5]. Sharpness has a rev erse effect, by using an unsharp masking with strength [10, 30]. Noise is added (from normal distribution with std. [0.1, 1.0]) to account for possible noise in images. Image Appearance is associated with the statistical characteristics of image intensities, such as v ariations of brightness and contr ast , which often result form dif ferent scanning protocols and device vendors. Brightness augmentation refers to random shift [-0.1, 0.1] 4 L. Zhang, et al. Source DST Unseen ROI 1 ROI 2 ROI 1 ROI 2 ROI 1 ROI 2 (a ) Prostate MRI (b) Heart MRI (c) Heart Ultrasound Fig. 2. Examples of deep stacked transformations (DST) results on (a) whole prostate MRI, (b) left atrial MRI, and (c) left ventricle ultrasound. 1st row: ROIs randomly cropped from source domains; 2nd ro w: corresponding R OIs after DST ; 3rd row: R OIs randomly cropped from unseen domains. The image pairs of 2nd–3rd rows ha ve better visual similarity than 1st–3rd ro ws. in the intensity space. Contrast augmentation refers to gamma correction with gamma (magnitude) ranging between [0.5, 4.5]. Finally , we use a random linear transform in intensity space with magnitude of scale and shift sampled from [-0.1, 0.1], which we refer to as intensity perturbation . Spatial T ransf orms include r otation , scaling and deformation . Rotation is usually caused by different patient orientations during scanning (we use [-20 ◦ , 20 ◦ ] range). Scaling and Deformation are due to organ shape v ariability and soft tissue motion. Random scaling is used with magnitude [0.4, 1.6]. Deformation transform uses regular grid interpolation, after a random perturbation (Gaussian smoothed std [10, 13]). Same spatial transform are applied to both input images and the corresponding labels. These operations are computational expensiv e for large 3D volumetric data. GPU-based ac- celeration approach could be dev eloped, but allocating the maximal capacity of GPU memory for model training only along with data augmentation on the ﬂy are more de- sirable. In addition, since the whole 3D volume does not ﬁt into the memory of the GPU, sub-volumes cropping are usually needed to fed into network training. W e de- velop a CPU-based, efﬁcient, spatial transform technique based on an open-source im- plementation 1 , which ﬁrst calculates the 3D coordinate grid of sub-v olume (with size of w × h × d vox els) to which the transformations (combining random 3D rotation, scaling, deformation, and cropping) are applied and then image interpolation is performed. W e make further acceleration by only performing interpolation within the minimal cuboid containing the 3D coordinate grid, as such, the computational time is independent from the input volume size (i.e., only depend on the cropping sub-v olume size), and the spa- tial transform augmentation can be performed on the ﬂy during training. 1 https://github .com/MIC-DKFZ/batchgenerators Unseen Domain Generalization 5 T able 1. Datasets used in our experiment. T ask 1. MRI - whole prostate 2. MRI - left atrial 3. Ultrasound - left ventricle Domain Source Unseen Source Unseen Source Unseen Dataset MSD-P PROMISE12 NCI-ISBI13 ProstateX MSD-H ASC MM-WHS CETUS-A CETUS-B CETUS-C # Data 26/6 50 60 98 16/4 100 20 8/2 10 10 3 Experiments 3.1 Datasets W e validate our method on three segmentation tasks: segmentation of whole prostate from 3D MRI, left atrial from 3D MRI, and left ventricle from 3D ultrasound. T ask 1: For the whole prostate segmentation from 3D MRI, we use the following datasets: Prostate dataset from Medical Segmentation Decathlon 1 (MSD-P), PROMISE12 [4], NCI-ISBI13 2 , and ProstateX [3]. W e train on the MSD-P dataset (source domain) and ev aluate on the other datasets (unseen domains). W e use only single channel (T2) input and segment the whole prostate, which is lowest common denominator among the datasets. One study in ProstateX was excluded due to prior sur gical procedure. T ask 2: For left atrial segmentation from 3D MRI, we use the following datasets: Heart dataset from MSD 1 (MSD-H), ASC [8] and MM-WHS [13]. W e train on the MSD-H dataset (source domain) and ev aluate on the other datasets. T ask 3: For left ventricle segmentation from 3D ultrasounds, we use data from CETUS 3 (30 volumes). W e manually split the dataset into 3 subsets corresponding to different ultrasound device vendors A, B, C with 10 volumes each. W e used heuristics to iden- tify vendor association, but we acknowledge that our split strategy may include wrong associations. W e train on V endor A images, and ev aluate on V endors B and C. T able 1 summarizes the datasets. In addition, a larger proprietary 3D MRI dataset of 465 volumes is used in the ﬁnal e xperiment (see Section 3.3). 3.2 Implementation W e implemented our approach in T ensorﬂow and train it on NVIDIA T esla V100 16GB GPU. W e use AH-Net [5] as a backbone for 3D segmentation, which takes advantages of the 2D pretrained ResNet50 as an encoder, and learns the full 3D decoder . All data is re-sampled to 1x1x1mm isotropic resolution and normalized to [0,1] intensity range. W e use a crop size of 96x96x32 batch 16 for T ask1, crop 96x96x96 batch 16 for T asks 2, and crop 96x96x96 batch 4 for T asks 3. W e use soft Dice loss and Adam optimizer with the learning rate 10 − 4 . W e use 0.5 probability of each transformation in DST . 3.3 Experimental Results and Analysis First, we ev aluate generalization performance for each augmentation transform indi- vidually . As a baseline , only random cropping with no other augmentations used. W e 1 http://medicaldecathlon.com/index.html 2 http://doi.org/10.7937/K9/TCIA.2015.zF0vlOPv 3 https://www .creatis.insa-lyon.fr/Challenge/CETUS/ 6 L. Zhang, et al. T able 2. The effect of DST and v arious augmentation methods on unseen domain generalization (measured as segmentation Dice scores). Source columns indicates the dataset used for training, and its Dice scores are v alidation Dice scores (using a split) for comparisons. Unseen columns list Dice results when applied to unseen datasets (of the model trained on the source). Here baseline refers to a random crop with no further augmentations. T op4 stands for the combination of four best performing augmentations (sharpening, brightness, contrast, scaling). Supervised indicates the state-of-the-art literature results, when a model is trained and tested on the same dataset. ∗ indicates inter-observ er variability . T ask 1. MRI - whole prostate T ask 2. MRI- left atrial T ask 3. US - left ventricle All T asks Source Unseen Source Unseen Source Unseen Source Unseen MSD-P PROMISE NCI-ISBI ProstateX MSD-H ASC WHS CETUS-A C-B C-C A verage A verage Baseline 89.6 60.4 58.0 76.8 91.9 4.4 72.9 85.8 51.7 39.2 89.1 49.8 Sharpening 90.6 65.5 82.8 84.0 91.5 5.7 78.9 83.7 59.5 78.5 88.6 62.9 Blurring 86.1 63.9 67.0 79.9 90.9 3.3 76.9 90.5 73.4 72.4 89.2 61.1 Noise 91.1 59.3 67.4 81.4 91.4 8.3 78.0 87.3 66.8 62.2 90.0 59.0 Brightness 89.7 63.3 66.9 83.0 91.3 12.2 80.2 85.5 63.6 83.1 88.8 63.6 Contrast 91.1 72.7 60.7 86.1 91.3 12.7 78.6 88.4 58.4 85.5 90.3 63.6 Perturb 90.1 63.4 69.5 81.5 91.7 6.6 77.3 88.5 63.6 83.1 90.1 55.7 Rotation 87.4 59.0 57.9 75.1 91.2 5.2 72.1 78.0 60.4 62.6 85.5 54.7 Scaling 90.8 59.3 60.8 78.8 91.3 7.4 75.3 91.0 84.1 68.2 91.0 61.3 Deform 89.7 61.4 61.5 81.2 91.6 7.8 69.2 86.3 62.4 31.4 89.2 51.1 T op4 91.0 73.5 83.0 86.5 91.6 45.4 79.4 90.9 81.9 80.5 91.2 74.9 CycleGAN - 74.7 76.4 81.2 - 18.0 76.2 - 65.3 66.6 - 63.5 DST (ours) 91.3 80.2 85.4 86.5 91.4 65.5 80.0 92.1 84.9 81.3 91.6 80.0 Supervised - 91.4 [12] 88.0 [2] 91.9* - 94.2 [8] 88.6 - 92.5* 92.5* - 91.4 compare results to DST with all 9 transformation stacked, and to a popular domain adaptation method, CycleGAN [11], which maps the unseen images (on per slice basis) into source-like appearance (we split each dataset into 4:1 for CycleGAN training and validation, and train for 200 epochs). T able 2 lists segmentation Dice results on the source domain (trained on this do- main, and validated on a keep-out subset) and on unseen domains (trained on the source, but tested on other unseen datasets). The major ﬁndings are: • DST augmentation performs substantially better than any one of the tested augmen- tations. On average, across different tasks, DST achiev es 80% generalization Dice on unseen domains. Compare to baseline (49.8%) and CycleGAN (63.5%), which achiev e worse generalization performance (e ven though e.g. CycleGAN domain adaptation got exposure to unseen domain images). • In 3D MRIs, image quality and appearance augmentation had the most impact, with larger improvements coming from sharpening , followed by by contrast , brightness , and intensity perturbation . Spatial transforms had less impact in prostate MRI com- pared to heart MRI where the shape, size, and orientation of heart can be very different (see Figure 1). • In Ultrasound, main contributions came from spatial scaling , followed by bright- ness , blurring , and contrast augmentations (see Figure 1(c)). • In some datasets (such as ASC), all the individual augmentations and CycleGAN perfomred very poorly ( < 13% Dice), whereas DST had reasonable performance. This supports our claim that comprehensi ve transforms are required to cover poten- tially large v ariability of the unseen data. Unseen Domain Generalization 7 T able 3. The effect of DST with larger data (465 3D MRI) for the task of whole prostate seg- mentation. Methods marked with * are trained and tested on the same domain or inter-observer variability (91.9%). No e valuation of whole prostate se gmentation av ailable in MSD challenge. Source Unseen train val MSD-P PROMISE NCI-ISBI ProstateX A verage Baseline 95.6 89.9 87.8 82.9 88.8 90.6 87.5 DST (ours) 94.1 91.8 89.1 88.1 89.4 91.9 89.6 State-of-the-art - - - 91.4* [12] 88.0* [2] 91.9* 90.4 • Individual augmentation transforms may perform slightly better on some isolated cases (e.g. brightness augmentation for WHS), but on av erage only DST consis- tently shows good generalization. Even the combination of top 4 performing aug- mentations (top4) is not sufﬁcient for rob ust generalization. • Using only simple random crop (baseline) does not generalize well to unseen datasets (with Dice dropping as much as 40%) , which supports importance of data augmen- tation in general. • Besides the improvements on unseen domains, DST slightly improves (2.5%) on the source domains as well (it is valuable to not degrade the performance on the source domain). • DST peformance is ∼ 10% worse compared to fully supervised methods, as they hav e adv antages of training and testing on the same domain and more training data. This gap can be reduced by using a lar ger source dataset (as sho wn in Section 3.3), in which case the DST performance is comparable to the supervised methods. Examples of unseen domain segmentation produced by baseline model, CycleGAN- based domain adaptation, and DST domain generalization are shown in Fig. 3. The baseline and DST are trained only on individual source domains, while CycleGAN requires images from target/unseen domain to train an additional generati ve model. DST with Larger Dataset. So far we have ev aluated that DST generalization perfor- mance using small ( ∼ 30 volumes) public datasets. In this section, we experiment with a lar ger dataset, and demonstrate generalization performance comparable to supervised state-of-the-art methods. W e train a model with DST on proprietary dataset of 465 3D MRIs (denoted as MultiCenter) with whole prostate annotations, collected from various medical centers worldwide. T able 3 show the results on unseen datasets. Overall, using a large source dataset, DST produces competiti ve results: with Dice being only 0.8% lower than state- of-the-art supervised methods. Supervised models were trained on the same domain individually , where we were able to achiev e similar performance training only on the source domain. Importantly , on the unseen domain, our DST model achiev es the same performance as two radiologists (relative novice versus expert) – it achiev es a Dice score of 91.9% on the unseen ProstateX dataset, compared with the Dice score between a novice versus expert radiologist annotations on the same dataset (also 91.9%). These ﬁndings suggest feasibility of practical application of deep learning models in clinical sites, where the trained DST model generalize well to unseen data. 8 L. Zhang, et al. T a s k 1 T a s k 2 T a s k 3 Unseen Image Ground T ruth Baseline CycleGAN DST Dice = 62.4% Dice = 72.3% Dice = 80.3% Dice = 0.0% Dice = 35.9% Dice = 82.5% Dice = 18.8% Dice = 79.0% Dice = 87.6% Fig. 3. Generalization to unseen domains for three different 3D medical image segmentation tasks. Baseline deep models have low performances on unseen MRI and ultrasound images from different clinical centers, scanner v endors, etc. CycleGAN-based domain adaptation method helps improve segmentation performances. DST training generates robust models which signiﬁ- cantly improve segmentation performances on unseen domains. Segmentation masks (red) over - lay on unseen or CycleGAN synthesized images. 4 Conclusion W e propose deep stacked transformations (DST) augmentation approach for unsuper- vised domain generalization in 3D medical image segmentation. W e ev aluate DST and different augmentation strategies on three segmentation tasks (prostate 3D MRI, left atrial 3D MRI and left ventricle 3D ultrasound) when applied to unseen domains. The experiments establish a strong benchmark for the study of domain generalization in medical imaging. Furthermore, using a larger training dataset, we show that DST gener- alization performance is comparable to fully supervised state-of-the-art methods, mak- ing deep learning segmentation more feasible in practise. Unseen Domain Generalization 9 References 1. Degel, M.A., Nav ab, N., Albarqouni, S.: Domain and geometry agnostic CNNs for left atrium segmentation in 3D ultrasound. In: MICCAI. pp. 630–637 (2018) 2. Jia, H., Song, Y ., Zhang, D., Huang, H., Feng, D., Fulham, M., Xia, Y ., Cai, W .: 3d global con volutional adversarial network for prostate MR volume segmentation. arXiv preprint arXiv:1807.06742 (2018) 3. Litjens, G., Debats, O., Barentsz, J., Karssemeijer, N., Huisman, H.: Computer-aided detec- tion of prostate cancer in MRI. TMI 33 (5), 1083–1092 (2014) 4. Litjens, G., T oth, R., v an de V en, W ., Hoeks, C., Kerkstra, S., v an Ginneken, B., V incent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate segmentation algorithms for MRI: the PR OMISE12 challenge. Medical Image Analysis 18 (2), 359–373 (2014) 5. Liu, S., Xu, D., Zhou, S.K., Pauly , O., Grbic, S., Mertelmeier , T ., Wicklein, J., Jerebko, A., Cai, W ., Comaniciu, D.: 3D anisotropic hybrid network: T ransferring conv olutional features from 2D images to 3D anisotropic volumes. In: MICCAI. pp. 851–858. Springer (2018) 6. Romera, E., Bergasa, L.M., Alvarez, J.M., Tri vedi, M.: Train here, deploy there: Robust segmentation in unseen domains. In: 2018 IEEE Intelligent V ehicles Symposium (IV). pp. 1828–1833. IEEE (2018) 7. V olpi, R., Namkoong, H., Sener, O., Duchi, J., Murino, V ., Savarese, S.: Generalizing to unseen domains via adversarial data augmentation. In: NeurIPS (2018) 8. Xiong, Z., Fedorov , V .V ., Fu, X., Cheng, E., Macleod, R., Zhao, J.: Fully automatic left atrium segmentation from late gadolinium enhanced magnetic resonance imaging using a dual fully con volutional neural network. TMI 38 (2), 515–524 (2019) 9. Y asaka, K., Abe, O.: Deep learning and artiﬁcial intelligence in radiology: Current applica- tions and future directions. PLoS Medicine 15 (11), e1002707 (2018) 10. Zhang, Y ., Miao, S., Mansi, T ., Liao, R.: T ask driv en generative modeling for unsupervised domain adaptation: Application to X-ray image segmentation. In: MICCAI (2018) 11. Zhu, J.Y ., Park, T ., Isola, P ., Efros, A.A.: Unpaired image-to-image translation using cycle- consistent adversarial networks. In: ICCV . pp. 2223–2232 (2017) 12. Zhu, Q., Du, B., Y an, P .: Boundary-weighted domain adaptive neural network for prostate MR image segmentation. arXi v preprint arXiv:1902.08128 (2019) 13. Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmen- tation of MRI. Medical Image Analysis 31 , 77–87 (2016)

When Unseen Domain Generalization is Unnecessary? Rethinking Data Augmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment