SegNAS3D: Network Architecture Search with Derivative-Free Global Optimization for 3D Image Segmentation

SegNAS3D: Net w ork Arc hitecture Searc h with Deriv ativ e-F ree Global Optimization for 3D Image Segmen tation ? Ken C. L. W ong, Mehdi Moradi IBM R esearch – Almaden Research Cen ter, San Jose, CA, USA { clwong, mmoradi } @us.ibm.com Abstract. Deep learning has largely reduced the need for man ual fea- ture selection in image segmentation. Nevertheless, net work architecture optimization and h yp erparameter tuning are mostly manual and time consuming. Although there are increasing researc h eﬀorts on netw ork ar- c hitecture search in computer vision, most works concentrate on image classiﬁcation but not segmentation, and there are very limited eﬀorts on medical image segmentation esp ecially in 3D. T o remedy this, here w e propose a framework, SegNAS3D, for netw ork architecture searc h of 3D image segmentation. In this framework, a netw ork arc hitecture com- prises interconnected building blo c ks that consist of op erations such as con volution and skip connection. By represen ting the block structure as a learnable directed acyclic graph, h yp erparameters suc h as the num ber of feature channels and the option of using deep sup ervision can b e learned together through deriv ative-free global optimization. Exp erimen ts on 43 3D brain magnetic resonance images with 19 structures achiev ed an av- erage Dice coeﬃcient of 82%. Eac h architecture searc h required less than three days on three GPUs and produced architectures that were muc h smaller than the state-of-the-art manually created architectures. 1 In tro duction Although deep learning has largely reduced the need for manual feature selection in image segmen tation [1,7], days to weeks are still required to manually search for the appropriate arc hitecture and hyperparameters. T o further reduce h u- man workloads, net w ork architecture searc h (NAS) has b een prop osed for image classiﬁcation in the computer vision communit y to automatically generate net- w ork architectures. In [13], a recurrent netw ork was used to generate the mo del descriptions of neural netw orks, which was trained with reinforcement learning on 800 GPUs to learn architectures from scratch. In [12], a blo c k-wise netw ork generation pip eline w as introduced to automatically build netw orks using the Q-learning paradigm with tremendous increase of searc h eﬃciency . ? This pap er w as accepted by the International Conference on Medical Image Com- puting and Computer-Assisted Interv ention – MICCAI 2019. Ima g e ( sz =128 3 ) Op tional deep sup er vision Mg B l k (  ) Ga us sian noise ( s t d =1) Dr opout (0 . 5 )  … … … Con v (1  1  1) + sof tma x  – nu mbe r of c han ne l s i n fir s t blo c k   – nu mbe r of c han ne l s i n a b l oc k  – nu mbe r of ma x pooli ng s  – nu mbe r of labe l s Con v (1  1  1) + BN + R eL U Ma x pooli ng (2  2  2) Up samp l i ng (2  2  2) Cop y and c onc a t en a t e MgBlk(   )  Spa t i al dr opout (0 . 1 ) Bloc k(   )   O p tion al r es idu al c on necti on MgB l k (   ) MgB l k (    ) MgB l k (   ) MgB l k (  ) Fig. 1. Ov erall architecture. Blue and white boxes represen t op eration outputs and copied data, respectively . A MegaBlock (MgBlk) comprises a learnable blo c k in Section 2.1 with spatial drop out and an optional residual connection. n , p , and the options of using deep sup ervision and blo c k residual connections are learnable hyperparameters. Although these works are promising, eﬀorts on NAS for medical image seg- men tation are very limited esp ecially in 3D. In [8], the policy gradient rein- forcemen t learning has b een used to learn the kernel size and the num b er of feature channels of each conv olutional lay er of a custom net work architecture for 2D medical image segmen tation. Without learnable lay er interconnections, this framew ork mainly p erforms h yperparameter tuning rather than arc hitecture searc h, and the computational complexit y is infeasible for 3D image segmenta- tion. In fact, the computational requirements for 3D images are muc h higher than 2D images. F urthermore, multiple lo cal optima can be exp ected in the ar- c hitecture searc h space but they are not handled b y most framew orks. Therefore, dev eloping an eﬃcient NAS framework for 3D images is a very challenging task. In view of these issues, here w e propose a NAS framework, SegNAS3D, for 3D image segmentation with tw o key contributions. I) F or computational feasi- bilit y , inspired by [6], the o verall netw ork architecture is comp osed of rep etitiv e blo c k structures, with eac h blo ck structure represen ted as a learnable directed acyclic graph. Diﬀerent from [6], the interconnections among block structures are also modeled as learnable h yp erparameters for a more complete netw ork arc hitecture search. I I) By constructing the hyperparameter search space with con tinuous relaxation and handling un trainable situations such as the out-of- memory (OOM) error, deriv ativ e-free global optimization is applied to searc h for the optimal architecture. T o the b est of our knowledge, this is the ﬁrst w ork of netw ork architecture search for 3D image segmen tation with global optimiza- tion. Experiments on 43 3D brain magnetic resonance (MR) images with 19 anatomical structures ac hieved an a verage Dice co eﬃcient of 82%. Each archi- tecture search required less than three days on three GPUs, and the resulted net works were muc h smaller than the V-Net [7] on the tested dataset. 2 Metho dology F or computational feasibilit y , inspired by [6,12], the segmen tation net work ar- c hitecture comprises tw o k ey comp onen ts: the building blo c ks and their in ter- 0 1 2 3 0 1 0 0 1 3 0 2 2 3 Output node Input node 0 1 2 3 1 3 2 0 1 2 3 0 1 5 0 1 0 4 2 2 3 0 1 2 3 1 2 5 4 0 1 2 3 0 1 0 0 1 0 4 2 2 3 0 1 2 3 1 2 4 0 1 2 3 0 1 5 0 1 0 0 2 2 3 0 1 2 3 1 2 5 (a) Output node Input node Output node Input node Output node Input node (b) (c) (d) ops : {1, 0 , 0, 3, 0, 2} ops : {1, 5 , 0, 0, 4 , 2 } ops : {1, 0 , 0, 0, 4 , 2 } ops : {1, 5 , 0, 0, 0, 2} Fig. 2. Examples of upp er triangular op eration matrices and the corresp onding blo c k structures for a directed acyclic graph with four no des. Each integer matrix element represen ts an op eration in T able 1 and ops represen ts the corresp onding set of op era- tions. (a) The simplest blo c k structure. (b) A more complicated blo ck structure with m ultiple nodal inputs and outputs. (c) and (d) Illegal blo ck structures with node 2 as a source and no de 1 as a sink, resp ectiv ely . T able 1. Block op erations and their corresp onding num b ers. Conv( k , d ) represents a k × k × k conv olution with dilation rate d . d = 1 means no dilation. Each con volution is follow ed by batc h normalization and ReLU activ ation. 0 1 2 3 4 5 6 None Conv(1, 1) Conv(3, 1) Conv(5, 1) Conv(3, 2) Conv(5, 2) Skip connection connections (Fig. 1). A building blo c k comprises v arious deep-learning lay ers suc h as conv olution and batch normalization, whose pattern is rep eatedly used in the ov erall netw ork. The residual units of the ResNet [3] are go od examples. The building blo c ks are connected together to form the netw ork architecture. F or classiﬁcation netw orks, the blo cks are usually cascaded with p ooling lay ers in betw een [3,9]. F or segmen tation net works, there are more v ariations of ho w diﬀeren t blo c ks are connected [11,1,7]. 2.1 Blo c k Structure Inspired b y [6], a block is represen ted as a directed acyclic graph. Eac h node rep- resen ts a feature map (tensor) and eac h directed edge represents an op eration (e.g. conv olution). Here we represen t the graph as an upp er triangular opera- tion matrix which con tains all op erations among no des (Fig. 2). The rows and columns of the matrix represent the input and output no des, resp ectiv ely , with nonzero elemen ts represent op eration n umbers (T able 1). There are tw o t yp es of no des crucial for building trainable netw orks. 1) Source: a no de that does not ha ve paren ts in a block. 2) Sink: a no de that does not hav e c hildren in a blo c k. In a blo c k, only the ﬁrst no de can b e the source and the last no de can b e the sink as they are connected to other blocks. A net w ork cannot b e built if there are sources or sinks as the in termediate no des. Therefore, the simplest blo c k structure can b e represen ted b y a “shifted” diagonal matrix (Fig. 2(a)), and more complicated structures can also b e achiev ed (Fig. 2(b)). With the ma- trix represen tation, a source and a sink can b e easily iden tiﬁed as the column and the ro w with all zeros, resp ectiv ely (Fig. 2(c) and (d)). The blo c k op erations and the corresp onding num bers are shown in T able 1. The op erations include conv olutions with diﬀeren t k ernel sizes ( k = 1 , 3 , 5) and dilation rates ( d = 1 , 2) for m ulti-scale features [11]. Eac h con v olution is follo wed b y batch normalization and ReLU activ ation. Skip connection which allows b et- ter conv ergence is also included. Outputs from diﬀerent no des are combined by summation as concatenation mostly led to the OOM error in our experiments. The num b er of no des ( nodes ) in a blo c k is also a learnable hyperparameter. T o reduce the complexit y of arc hitecture searc h, all blo c ks in a netw ork share the same op eration matrix, with the n umbers of feature channels systematica lly assigned based on the n umber of feature channels of the ﬁrst block (Section 2.2). 2.2 Net work Arc hitecture and Blo ck-Connecting Hyp erparameters Although there are multiple wa ys to connect the blo c ks together for image seg- men tation, we adopted an architecture similar to the U-Net [1] and V-Net [7] as they were prop osed for 3D medical image segmentation (Fig. 1). The arc hitecture con tains the enco ding and decoding paths with MegaBlo c ks. Eac h MegaBlo c k comprises a blo c k in Section 2.1 with spatial drop out and an optional resid- ual connection to reduce ov erﬁtting and enhance conv ergence. The num b er of c hannels is doubled after each max p ooling and is halved after each upsampling. Deep sup ervision whic h allows more direct backpropagation to the hidden la yers for faster conv ergence and b etter accuracy is also an option [5]. The num b er of feature channels of the ﬁrst block ( n ), the num ber of max p oolings ( p ), and the options of using deep sup ervision ( sup ) and blo c k residual connections ( r es ) are learnable blo c k-connecting h yp erparameters. 2.3 Global Optimization with Con tin uous Relaxation As the num b er of hyperparameter combinations can b e huge ( > 141 millions in some of our experiments) and eac h corresp onds to a netw ork training, brute force search is prohibitiv e and nonlinear optimization is required. Compared with discrete optimization, there are many more contin uous optimization algo- rithms a v ailable esp ecially for deriv ativ e-free global optimization [2]. Therefore, similar to [6], con tinuous relaxation is used to remo ve the integralit y constraint of each parameter. This also allows us to introduce non-integral hyperparam- eters such as the learning rate if desired. Diﬀeren t from [6] which formulated the problem for lo cal gradient-based optimization, we use deriv ativ e-free global optimization. This is b ecause it is nonoptimal to compute gradien ts of the discon- tin uous ob jective function, and multiple local minima can b e expected. Handling of un trainable situations is also simpler without gradients. T able 2. V ariations of the prop osed framework with diﬀerent learnable hyperparam- eters and their lo wer and upp er bounds. The eﬀective set of integers of eac h half-op en in terv al [ a , b ) is { a , . . . , b − 1 } . F or bounds [0, 2), { 0, 1 } means { Disable, Enable } . The upp er bound of nodes determines the num b er of blo c k-op eration h yp erparameters ( ops ) required. F or example, nodes with bounds [2, 5) requires six ops to ﬁll a 4 × 4 up- p er triangular matrix. Scalars in red are ﬁxed. F or SegNAS 4 , ops of { 2, 0, 2 } represents t wo cascaded Conv(3, 1) in T able 1. Block-connecting hyperparameters Block structures n p sup res nodes ops SegNAS 11 [8, 33) [2, 6) [0, 2) [0, 2) [2, 5) [0, 7) (6 × ) SegNAS 4 [8, 33) [2, 6) [0, 2) [0, 2) 3 { 2, 0, 2 } SegNAS 7 16 4 0 1 [2, 5) [0, 7) (6 × ) Let x ∈ R n h b e a v ector of n h h yp erparameters after contin uous relaxation. W e use b x c (ﬂo or of x ) to construct a net work architecture. Therefore, the ob- jectiv e function is a discon tin uous function in a b ounded con tinuous searc h space whic h can b e b etter handled by deriv ativ e-free global optimization. The ob jective function f = − ln( Dice ) is used, where D ice is the v alidation Dice coeﬃcient. The deriv ativ e-free global optimization algorithm “con trolled random searc h” (CRS) [4] is used as it pro vides eﬀectiv e search with go o d p erformance among tested algorithms. CRS starts with a p opulation of sample p oin ts (10 × ( n h + 1)) whic h are gradually ev olved b y an algorithm that resembles a randomized Nelder- Mead algorithm. Eac h search stops after 300 iterations. Sev eral issues need to b e handled for eﬀective and eﬃcient searc h. Firstly , to handle hyperparameters of illegal blo ck structures (Section 2.1) and OOM errors, w e assign them an ob jectiv e function v alue d max f e , which is 10 by clipping the minim um v alue of Dice as 10 − 4 . This tells the optimization algorithm that these situations are worse than having the worst segmentation. Secondly , as multiple x con tribute to the same b x c , w e sav e eac h b x c and the corresponding f to a void unnecessary training for b etter eﬃciency . 2.4 T raining Strategy In eac h netw ork training, image augmentation with rotation (axial, ± 30 ◦ ), shift- ing ( ± 20%), and scaling ([0.8, 1.2]) is used, and eac h image has an 80% chance to b e transformed. The optimizer Nadam is used for fast conv ergence with the learning rate as 10 − 3 . The exp onential logarithmic loss with Dice loss and cross- en tropy is used [10]. The IBM Po wer System A C922 equipp ed with NVLink for enhanced host to GPU comm unication w as used. This mac hine features NVIDIA T esla V100 GPUs with 16 GB memory , and three of these GPUs w ere used for m ulti-GPU training with a batch size of three and 100 ep o c hs. 0 50 100 150 200 250 Iteration 0.0 0.2 0.4 0.6 0.8 Validation Dice S e g N A S 1 1 S e g N A S 4 S e g N A S 7 Fig. 3. Evolutions of the v alidation Dice co eﬃcients during search. Examples of a dataset split. Iterations with illegal blo ck structures and OOM errors are omitted. The eﬀectiv e num b ers of v alidation Dice co eﬃcien ts for SegNAS 11 , SegNAS 4 , and SegNAS 7 w ere 139, 272, and 189, resp ectiv ely , out of 300 iterations. 3 Exp erimen ts 3.1 Data and Exp erimen tal Setups W e v alidated our framework on 3D brain MR image segmentation. A dataset of 43 T1-weigh ted MP-RAGE images from diﬀerent patients was neuroanatomi- cally lab eled to provide the training, v alidation, and testing samples. They were man ually segmen ted by highly trained experts, and each had 19 semantic lab els of brain structures. Eac h image was resampled to isotropic spacing using the minim um spacing, zero padded, and resized to 128 × 128 × 128. Three sets of dataset splits were generated by shuﬄing and splitting the dataset, with 50% for training, 20% for v alidation, and 30% for testing in each set. The training and v alidation data w ere used during arc hitecture searc h to pro- vide the training data and the v alidation Dice co eﬃcien ts for the ob jective func- tion. The testing data were only used to test the optimal netw orks after search. Three v ariations of our prop osed framew ork w ere tested (T able 2). SegNAS 11 optimizes b oth blo c k structures and their interconnections. SegNAS 4 optimizes only the blo c k-connecting hyperparameters with ﬁxed blo c k structures. SegNAS 7 optimizes only the blo c k structures with ﬁxed blo c k-connecting h yp erparame- ters inferred from the V-Net. Note that the subscripts indicate the n um b ers of h yp erparameters to b e optimized. W e p erformed exp erimen ts on the 3D U-Net [1] and V-Net [7] for comparison. The same training strategy and dataset splits w ere used in all exp erimen ts. 3.2 Results and Discussion Examples of the evolutions of the v alidation Dice co eﬃcien ts during search are sho wn in Fig. 3. In all tests, there w ere more ﬂuctuations at the early iterations as the optimization algorithm searched for the global optim um, and the evo- lutions gradually conv erged. SegNAS 11 had the least eﬀective n umber of Dice T able 3. Av erage results of all dataset splits and the optimal hyperparameters of a dataset split (same split as Fig. 3). The best results are in blue and the ﬁxed h yp erpa- rameters are in red. The testing Dice coeﬃcients are shown. GPU days are the num b er of searching da ys multiplied b y the num ber of GPUs (three) used. Strik ethrough ops of SegNAS 11 w ere not used to form the netw ork because of the num b er of nodes (three). Please refer to Section 2.1 and 2.2 for the deﬁnitions of hyperparameters. Average results (mean ± std) Optimal hyperparameters of a search Dice (%) Parameters (millions) GPU days n p sup r es nodes ops SegNAS 11 81.7 ± 0.3 9.7 ± 4.1 6.6 ± 0.6 26 3 0 1 3 { 2, 2, 3, 6, 3, 3 } SegNAS 4 81.0 ± 0.5 3.2 ± 0.6 3.6 ± 0.1 21 3 1 0 3 { 2, 0, 2 } SegNAS 7 77.7 ± 1.0 30.1 ± 5.4 8.2 ± 0.4 16 4 0 1 4 { 6, 2, 3, 0, 4, 3 } 3D U-Net OOM 19.1 ± 0.0 — — V-Net 47.9 ± 7.4 71.1 ± 0.0 — — Ground truth SegNAS 11 Dice = 83% SegNAS 4 Dice = 82% SegNAS 7 Dice = 78% V-Net Dice = 51% Fig. 4. Visualization of an example. T op: axial view. Bottom: 3D view with the cerebral grey , cerebral white, and cereb ellar grey matters hidden for b etter illustration. co eﬃcien ts (139) as its larger num b er of hyperparameter combinations led to more illegal structures and OOM errors. In con trast, SegNAS 4 had the most eﬀectiv e num b er (272). W e can also see that searching optimal blo c k structures (SegNAS 11 and SegNAS 7 ) led to larger ﬂuctuations, and searching only blo c k- connecting h yp erparameters (SegNAS 4 ) ga ve faster conv ergence. T able 3 shows the av erage results from all three dataset splits and the opti- mal hyperparameters of a dataset split. The V-Net gav e the low est testing Dice co eﬃcien ts and the largest mo del. SegNAS 11 had the b est segmentation p erfor- mance while SegNAS 4 pro duced the smallest mo dels with fewest GPU days for comparably go od p erformance. Among the v ariations, SegNAS 7 had the low est Dice co eﬃcien ts, largest models, and most GPU days. The 3D U-Net gav e the OOM error and pro duced a larger netw ork than SegNAS 11 and SegNAS 4 . As three GPUs were used, each searc h required less than three da ys to complete. Fig. 4 sho ws the results of an example which are consistent with T able 3. Therefore, the block-connecting h yperparameters n , p , sup , and res are more eﬀectiv e especially with simple blo c k structures such as that of SegNAS 4 . Searc h- ing also the blo c k structures can impro ve segmen tation accuracy with increased searc hing time and probably larger mo dels. Searching only the blo c k structures can lead to larger mo dels dep ending on the ﬁxed n , p v alues and is not as eﬀec- tiv e. The 3D U-Net ga ve the OOM error b ecause of its relatively large memory fo otprin t (e.g. tensors of 128 × 128 × 128 with 64 feature c hannels). The segmen- tations of the V-Net w ere inaccurate probably b ecause of insuﬃcien t training data giv en the num b er of netw ork parameters. When we increased the amount of training data from 50% to 70%, the testing Dice co eﬃcien ts of the V-Net in- creased to 68.1 ± 2.3%. These sho w the adv an tages of our framew ork as the OOM error is explicitly considered and the relation b et ween the netw ork size and the a v ailable data is automatically handled. 4 Conclusion W e presen t a netw ork arc hitecture search framework for 3D image segmen ta- tion. By represen ting the netw ork architecture with learnable connecting blo c k structures and iden tifying the hyperparameters to b e optimized, w e formulate the searc h as a global optimization problem with contin uous relaxation. With its ﬂexibilit y , we studied three v ariations of the framework. The results show that the blo ck-connecting h yp erparameters are more eﬀectiv e, and optimizing also the blo c k structures can further impro ve the segmentation performance. References 1. C ¸ i¸ cek, ¨ O., Ab dulk adir, A., Lienk amp, S.S., Bro x, T., Ronneb erger, O.: 3D U-Net: Learning dense volumetric segmen tation from sparse annotation. In: In ternational Conference on Medical Image Computing and Computer-Assisted In terven tion. LNCS, vol. 9901, pp. 424–432 (2016) 2. Conn, A.R., Sc heinberg, K., Vicente, L.N.: Introduction to Deriv ative-F ree Opti- mization. Siam (2009) 3. He, K., Zhang, X., Ren, S., Sun, J.: Identit y mappings in deep residual netw orks. In: European Conference on Computer Vision. LNCS, v ol. 9908, pp. 630–645 (2016) 4. Kaelo, P ., Ali, M.M.: Some v arian ts of the con trolled random search algorithm for global optimization. Journal of Optimization Theory and Applications 130(2), 253–264 (2006) 5. Lee, C.Y., Xie, S., Gallagher, P .W., Zhang, Z., T u, Z.: Deeply-sup ervised nets. In: International Conference on Artiﬁcial Intelligence and Statistics. pp. 562–570 (2015) 6. Liu, H., Simon yan, K., Y ang, Y.: DAR TS: Diﬀeren tiable architecture search. arXiv:1806.09055 [cs.LG] (2018) 7. Milletari, F., Na v ab, N., Ahmadi, S.A.: V-Net: F ully conv olutional neural net w orks for v olumetric medical image segmentation. In: IEEE International Conference on 3D Vision. pp. 565–571 (2016) 8. Mortazi, A., Bagci, U.: Automatically designing CNN arc hitectures for medical image segmentation. In: International W orkshop on Machine Learning in Medical Imaging. LNCS, vol. 11046, pp. 98–106 (2018) 9. Szegedy , C., Ioﬀe, S., V anhouck e, V., Alemi, A.A.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI Conference on Arti- ﬁcial Intelligence. pp. 4278–4284 (2017) 10. W ong, K.C.L., Moradi, M., T ang, H., Sy eda-Mahmo od, T.: 3D segmen tation with exp onen tial logarithmic loss for highly unbalanced ob ject sizes. In: In ternational Conference on Medical Image Computing and Computer-Assisted In terven tion. LNCS, vol. 11072, pp. 612–619 (2018) 11. Y u, F., Koltun, V.: Multi-scale con text aggregation b y dilated conv olutions. arXiv:1511.07122 [cs.CV] (2015) 12. Zhong, Z., Y an, J., W u, W., Shao, J., Liu, C.L.: Practical block-wise neural net work arc hitecture generation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2423–2432 (2018) 13. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv:1611.01578 [cs.LG] (2016)

SegNAS3D: Network Architecture Search with Derivative-Free Global Optimization for 3D Image Segmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment