Deep Angular Embedding and Feature Correlation Attention for Breast MRI Cancer Analysis
Accurate and automatic analysis of breast MRI plays an important role in early diagnosis and successful treatment planning for breast cancer. Due to the heterogeneity nature, accurate diagnosis of tumors remains a challenging task. In this paper, we …
Authors: Luyang Luo, Hao Chen, Xi Wang
Deep Angular Em b edding and F eature Correlation A tten tion for Breast MRI Cancer Analysis Luy ang Luo 1 , Hao Chen 2 , Xi W ang 1 , Qi Dou 3 , Huang jing Lin 1 , Juan Zhou 4 , Gong jie Li 5 , and Pheng-Ann Heng 1 1 Dept. of Computer Science and Engineering, The Chinese Universit y of Hong Kong, Hong Kong SAR, China 2 Imsigh t Medical T echnology , Co., Ltd., China 3 Dept. of Computing, Imp erial College London, London, UK 4 Dept. of Radiology , The Fifth Medical Center of Chinese PLA General Hospital, Beijing, China 5 Beijing Image Diagnostic Cen ter of Rimag, Beijing, China Abstract. Accurate and automatic analysis of breast MRI pla ys an imp ortan t role in early diagnosis and successful treatment planning for breast cancer. Due to the heterogeneit y nature, accurate diagnosis of tumors remains a c hallenging task. In this pap er, w e prop ose to iden- tify breast tumor in MRI by Cosine Margin Sigmoid Loss (CMSL) with deep learning (DL) and localize possible cancer lesion by COrrelation A ttention Map (COAM) based on the learned features. The CMSL em- b eds tumor features onto a hyper-sphere and imp oses a decision margin through cosine constraints. In this wa y , the DL mo del could learn more separable inter-class features and more compact in tra-class features in the angular space. F urthermore, we utilize the correlations among feature v ectors to generate attention maps that could accurately lo calize cancer candidates with only image-level label. W e build the largest breast can- cer dataset inv olving 10,290 DCE-MRI scan volumes for dev eloping and ev aluating the prop osed metho ds. The mo del driven by CMSL achiev ed classification accuracy of 0.855 and A UC of 0.902 on the testing set, with sensitivit y and specificity of 0.857 and 0.852, respectively , outperform- ing other comp etitiv e metho ds ov erall. In addition, the prop osed CO AM accomplished more accurate lo calization of the cancer center compared with other state-of-the-art w eakly sup ervised lo calization metho d. 1 In tro duction Breast cancer is the most common malignancy affecting women w orldwide [1]. Early diagnosis of breast cancer is essen tial for successful treatment planning, where Magnetic Resonance Imaging (MRI) pla ys a vital role for screening high- risk p opulations [2]. Clinically , radiologists use the Breast Imaging-Rep orting and Data System (BI-RADS) to categorize breast lesions into differen t levels ac- cording to their phenotypic c haracteristics presen ted in MRI images, indicating 2 L. Luo et al. differen t degrees of cancer risk. How ev er, such assessment suffers from in ter- observ er v ariance and often sub jectively relies on the radiologists’ exp erience. Moreo ver, due to the heterogeneity nature, tumors of the same pathological result (malignant or b enign) could ha ve diverse patterns and hence result in dif- feren t BI-RADS assessments. In other words, tumors could p ossess ambiguous in ter-class difference and large intra-class v ariance, whic h p oses a serious chal- lenge to accurate diagnosis of breast cancer. Generally , there are t wo ma jor tasks regarding to breast MRI tumor analy- sis: iden tification of tumors and localization of cancer candidates. Recen tly , Deep Learning (DL) based approaches hav e demonstrated great p oten tial in assisting diagnosis of breast cancer in an automatic and fast manner. Previous studies man ually annotated tumors and delib erately extracted the corresp onding slices or patc hes for classification [3,4]. Suc h metho ds depended on careful annotations b oth for training and testing and could not easily b e adopted to clinical appli- cation. On the other hand, Guy et al. [5] prop osed to first automatically lo calize the lesions and then classify cancer candidates at the second stage. Although the inference in the testing stage thereby w as free of lesion delineation, these w orks still required annotations for mo del training. T o get rid of manual extrac- tion for region of interest (RoI), Gabriel et al. [6] proposed to meta-learn the breast MRI cancer classification problem with only image-level lab els. How ever, all the men tioned studies w ere limited to small size datasets and consequently lac k of generalization v alidation. More imp ortan tly , the relatively lo w precision or sp ecificit y rep orted in these works implied that the aforementioned problem of inter-class difference and intra-class v ariance has not b een addressed yet. T o this end, we propose a Cosine-Margin Sigmoid Loss (CMSL) to tackle the heterogeneit y problem for breast tumor classification and COrrelation Atten tion Map (COAM) for precise cancer candidates lo calization, b oth with image-level lab els only . The CMSL is extended from the cosine loss originally designed for face verification [7]. It embeds the deep feature vectors onto a hyper-sphere and learns a decision margin b et w een classes in the angular feature space. As a re- sult, the learned features possess more compact in tra-class v ariance and more separable intra-class difference. In addition, we observ e a RoI shifting problem of lo calizing cancer b y class activ ation map [8]. Therefore, we prop ose a no vel w eakly sup ervised metho d, i.e., COAM, to localize cancer candidates more ac- curately by lev eraging deep feature correlations based on the Gram matrix. F ur- thermore, w e build the largest breast DCE-MRI dataset including 10,290 v olume scans from 1715 sub jects to develop and ev aluate our metho ds. 2 Metho ds Our framework of breast MRI tumor analysis consists of tw o parts as illustrated in Fig. 1. One is tumor classification by deep angular embedding driven DL Breast MRI Cancer Analysis 3 HWD HWD Cons truct Gr a m M a trix No r m a li ze Sig mo id Mal ig nancy Pr ediction 3D R es Ne t A n gu l a r E m bedd i n g M alig nan t F eature c os - m Ben ign F eature m W e igh t V e cto r H W D Deep F eatur e Map s … # HW D W H D F eature Co rre lati o n A tt ent io n … Fig. 1. The framework of breast MRI cancer analysis. A 3D ResNet is first trained with CMSL by em b edding the deep features onto h yp er-sphere. In the testing stage, the deep features are used to construct Gram matrix to obtain correlation atten tion map. net work. The other is w eakly sup ervised cancer candidates lo calization with feature correlation atten tion map. 2.1 Cosine Margin Sigmoid Loss for T umor Classification The phenot yp e of tumors has ambiguous inter-class difference and large intra- class v ariance. Accordingly , the features learned b y the DL mo del could inherit these characteristics. T o address this issue, we start by revisiting the traditional sigmoid loss for binary classification problem. Giv en the input feature vector x of the last fully connected (F C) lay er and its corresp onding lab el y , the binary sigmoid loss is as follo ws: L ( w ; x ) = − y · log( p ( y | x )) − (1 − y ) · log(1 − p ( y | x )) (1) = − y · log( 1 1 + e − w T x ) − (1 − y ) · log(1 − 1 1 + e − w T x ) (2) where w is the weigh t parameter of the FC la yer, and p ( y | x ) represents the probabilit y of x b eing classified to y . T o distinguish different classes, the DL mo del is exp ected to give differen t predictions b y adjusting the v alue of w T x . Notice that w T x = k w kk x k cosθ , where θ is the angle b etw een feature v ector x and weigh t vector w , and k · k is the L 2 norm op eration. Generally , the DL mo del w ould implicitly alter k w k and k x k in the Euclidean space and cosθ in the angular space. Ho wev er, the aforementioned heterogeneity issue could lead to ambiguous features that are quite hard to discriminate. T o this end, con- strain ts on feature distances are considered to regulate the DL mo del for more separable inter-class features and more compact in tra-class features [7]. Since Euclidean distance is not b ounded and hence difficult to constrain, w e prefer to add regularization on the angular distance which is b ounded by − 1 ≤ cosθ ≤ 1. Sp ecifically , we eliminate the influence of the norms k x k and k w k by mo difying 4 L. Luo et al. W Be n ig n Mali gna nt W Beni gn Mal ig na nt Mar gin W Be n ig n Mal ig na nt s =1,m =0 s =20,m= 0 s =20,m= 0.35 Fig. 2. The illustration of NSL with s = 1, NSL with s = 20 and CMSL with s = 20 and m = 0 . 35. First row is the geometric interpretation of feature pro jection on a 2D sphere. Dashed arro ws represen t the decision b oundaries. Second ro w is the plot of corresp onding sigmoid functions. Dashed curves represent the v alues out of range. the computation of p ( y | x ) to: p ( y | x ) = 1 1 + e − s w T x k w kk x k = 1 1 + e − s · cosθ (3) where s is a h yp er-parameter adjusting the slop e of the sigmoid function and con trolling the back propagated gradien t v alues. If s is to o small, the loss cannot con verge to 0 b ecause the sigmoid function is not able to reach its saturation area, giv en that − 1 ≤ cosθ ≤ 1. On the con trast, if s is set to a large v alue, the sigmoid function could easily reach the saturation area and result in small gradients, whic h preven ts the netw ork from learning sufficient knowledge. F ollo wing [7], we refer to the loss with modified p in Eq.(3) as Normalized Sigmoid Loss (NSL), whic h focuses on separating features in the angular space with decision b oundary cosθ = 0 for b oth classes. Geometrically , w e em b ed the feature vector and the w eight vector onto a h yp er-sphere whose radius is tuned by s . How ever, the am biguous features can still distribute near this b oundary . Therefore we add an explicit guidance to NSL as follows: L ( w ; x ) = − y · log( 1 1 + e − s · ( cosθ − I ( y ) · m ) ) − (1 − y ) · log(1 − 1 1 + e − s · ( cosθ − I ( y ) · m ) ) (4) where I ( · ) is an indicator function. I ( y ) = 1 if y = 1 and I ( y ) = − 1 otherwise. m is a hyper-parameter that c hanges the decision b oundaries for separating t wo classes (0 and 1 for benign and malignan t) to: B 0 : cosθ + m < 0 and B 1 : cosθ − m > 0. Hence a decision margin is imp osed b y m in the angular space to make the learned inter-class features more separable. Consequently , the distribution space of features shrinks, whic h ev entually leads to more compact in tra-class features. Fig. 2 shows a comparison among different sigmoid functions and the corresp onding geometric illustrations. Breast MRI Cancer Analysis 5 2.2 F eature Correlation Atten tion for Cancer Lo calization Based on the well trained netw ork, lo calization of cancer candidates can provide more evidences for clinical reference. Therefore, our secondary goal is to lo calize p ossible cancers out of other lesion mimics. It is natural for DL studies to use Class Activ ation Map (CAM) [8] for obtaining the Region of Interest (RoI) when only image-level label is av ailable. How ever, it can not b e well generalized to our case due to an observ ed RoI shifting problem. With the CNN going deep er, the reception fields of neurons b ecome larger, hence neighbors of the tumor feature also capture views ov er the tumor patc h in the image. Since the feature vectors corresp onding to different classes could b e ambiguous, the classifier la yer would p ossibly tend to find discriminative patterns in the neighbors. Consequently , the corresp onding RoI generated b y CAM would shift from the desired target. T o tackle this problem, w e further figure out t wo insights of our task. First, the feature v ectors of the same semantic (malignant or normal) ought to hav e high correlations with each other. Second, through a series of rectified linear units, the netw ork would implicitly learns large activ ation v alues for features re- lated to suspicious cancer patc h (with lab el “1”), and small activ ation v alues for features related to normal patch (with lab el “0”). Based on these tw o intuitions, w e leverage the Gram matrix [9] to find the RoI. Giv en the deep feature map X ∈ R H × W × S × C from the last activ ation lay er, where H , W, S and C are the heigh t, width, n umber of slices and n umber of channels, resp ectiv ely , we first re- shap e X to X 0 ∈ R N × C , where N = H × W × S . Then we compute an atten tion v ector M ∈ R N as follows: M i = N X j =1 G i,j = N X j =1 C X k =1 X 0 i,k X 0 j,k (5) where G ∈ R N × N is the Gram matrix ov er the set of deep feature v ectors in X 0 . Each entry G i,j is the inner product of X 0 i and X 0 j , representing the corre- lation betw een i -th and j -th v ector. Because our netw ork is trained for binary classification, it enables the gap b et ween large and small activ ation v alues of feature vector related to suspicious cancer and normal patc h. Corresp ondingly , the correlation v alue w ould also b e relatively large or small according to the activ ation v alues of the features. Inspired by [10], each column G i can be in- terpreted as a sub-atten tion map implying the netw ork’s atten tion of the class that i -th vector b elongs to. Thus the ab o ve op eration is equal to elemen t-wise summation ov er all sub-attention maps G i . Moreov er, since G is symmetric, the elemen t-wise summation is also equiv alent to summing o v er G i to be the v alue of M i . Essen tially , P N j =1 G i,j indicates the imp ortanc e of i -th feature determined b y the sub-attention of the feature map at its i -th p osition. A t last, b y sim- ply reshap e M to H × W × S we are able to obtain an attention map purely based on the deep feature correlations. W e refer to this metho d as COrrelation A ttention Map (COAM). It is worth men tioning that CO AM is related to the self-atten tion mec hanism [10] and the stationary feature space represen tation [9]. 6 L. Luo et al. Ho wev er, it differs from thes e w orks b ecause the Gram matrix is not inv olv ed at an y optimization stage and directly used for atten tion map generation. 3 Exp erimen ts and Results 3.1 Implemen tation Details Dataset W e built the largest breast tumor Dynamic Con trast Enhanced (DCE) MRI dataset in volving 10,290 scans from 1715 sub jects, with 1137 cases con tain- ing malignant tumors and 578 cases containing b enign tumors. All of the scans w ere conducted with a 1.5-T Siemens system. W e collected 6 DCE-MRI sub- traction scans and 1 non-fat suppressed T1 scan from each sub ject. BI-RADS categories were assessed b y 3 radiologists. Pathological lab els w as given by biopsy or surgery diagnosis. The data were randomly divided into training, v alidation and testing sets with 1204, 165 and 346 sub jects, respectively . Prepro cessing F rangi’s approach[11] was first applied on the slices of each non-fat suppressed T1 scan to detect eviden t edges. Next, thresholding, small connected comp onen t remov al and hole filling were sequentially emplo yed to ob- tain coarse breast region masks. Afterw ards, the 2D masks were stac ked in to v olumes, follo wed by Gaussian smooth. W e used the 3D masks to segment the subtractions. Note that the DCE-MRI and non-fat suppressed scans were origi- nally registered in the scanning machine. Finally w e clipp ed and normalized the in tensity v alues, concatenated 6 subtractions, and cropped or padded the data to a fixed size of 340 × 220 × 128 as the mo del inputs. T raining Strategy W e used 3D ResNet34 [12] as the base mo del and replaced the global a verage p o oling lay er and F C lay er with an 1 × 1 × 1 conv olutional la yer app ended with a p o oling la yer. The h yp er-parameter s and m were set to 20 and 0.35, resp ectively , similar to [7]. The learning rate was initially set to 10 − 4 and decreased 10 times when training error stagnated. The base model is trained until conv ergence and then employ ed to initialize all other metho ds. 3.2 Ev aluation and Comparison T umor Classification W e conducted comparison among several deep learning metho ds: (1) 2D MIL : a multi instance metho d aggregating features from 2D slices by 2D ResNet34 [13]; (2) 3D R esNet : a 3D implementation of ResNet34; T able 1. Comparison of differen t metho ds on cancer classification. Metho d Accuracy Sensitivity Sp ecificity F1 AUC 2D MIL [13] 0.789 0.870 0.626 0.846 0.842 3D ResNet [12] 0.821 0.840 0.783 0.862 0.880 3D Sparse MIL [14] 0.832 0.857 0.783 0.872 0.885 3D DK-MT [15] 0.824 0.896 0.643 0.864 0.883 3D ResNet+NSL 0.821 0.840 0.783 0.862 0.874 3D ResNet+CMSL (ours) 0.855 0.857 0.852 0.888 0.902 Breast MRI Cancer Analysis 7 Fig. 3. Comparison Betw een CAM and COAM. W e select typical slices from different sub jects for a qualitative demonstration. First ro w: DCE-MRI subtraction slice; second ro w: visualization of CAM; third ro w: visualization of CO AM. Cancer lesions are circles b y red. Best viewed in color. (3) 3D Sp arse MIL : a sparse lab el assign metho d [14]; (4) 3D DK-MT : a domain kno wledge driven multi-task learning netw ork [15]; (5) 3D R esNet+NSL : Nor- malized sigmoid loss based on (2); (6) 3D R esNet+CMSL : our prop osed CMSL based on (2). W e computed the accuracy , specificity , sensitivity , F1 score and A UC as the ev aluation metrics. Exp erimental results are rep orted in T able 1. Compared with 2D metho d, 3D approaches achiev ed b etter results b y uti- lizing more spatial information. Both 3D Sp arse MIL and 3D DK-MT adopted additional assumption or knowledge, leading to b etter p erformance than v anilla 3D R esNet . Noticeably , 3D DK-MT sho wed po or sp ecificit y , which is p ossibly due to im balanced auxiliary knowledge (more BI-RADS 4 and 5 than 3) that dominated the learning pro cess. F or deep angular em b edding based metho ds lik e 3D R esNet+NSL , simply taking the features into angular space without margin constrain t caused certain p erformance decay . It implied that the net work cannot learn sufficien t kno wledge if s is set to a large v alue. Moreo v er, our proposed 3D R esNet+CMSL metho d significan tly impro ved the results. The underlying reason is that it could learn more discriminativ e patterns b y imp osing cosine margin. Our metho d achiev ed the highest sp ecificit y with ov er 7.9% b etter than all other metho ds and k ept a comparable sensitivit y at the mean time. It ex- ceeded all other methods with ov er 2% in A UC, o ver 3% in accuracy and ov er 1.5% in F1 score, proving that addressing the inter- and intra-class problem can impro ve p erformance of breast tumor classification. Cancer Localization T o ev aluate the p erformance of COAM, w e in vited the radiologists to manually annotate 85 samples that were classified as malignant b y our mo del. W e compared our method with CAM b y computing the Euclidean distance b et ween center p osition of the annotation and the v oxel p osition with highest v alue in the attention map. Then the distance is multiplied by the vo xel spacing, i.e., 1.1 mm, as the final measurement. The criteria is rep orted in the form of mean ± std , where mean and stdv represent the mean v alue and standard deviation of the center distances ov er 85 samples, resp ectiv ely . Compared to the distance of 39.84 ± 8.82mm by CAM, CO AM sho wed a significan t adv antage with 8 L. Luo et al. 18.26 ± 13.65 mm only . Fig. 3 sho wed the qualitative comparison with these tw o metho ds. 4 Conclusion In this pap er, we propose the cosine margin sigmoid loss for breast tumor classi- fication and correlation attention map for weakly sup ervised cancer candidates lo calization based on MRI scans. First, we use CMSL driv en deep net work to learn more separable inter-class features and more compact in tra-class features whic h effectiv ely tac kle the heterogeneit y problem of tumors. In addition, the prop osed CO AM lev erages correlations among deep features to localize region of interests in a weakly sup ervised manner. Extensive exp erimen ts on our large- scale dataset demonstrates the efficacy of our metho ds which outp erform other state-of-the-art approaches significan tly on b oth tasks. Our metho ds are general and can b e extended to man y other fields. References 1. DeSan tis, C. E., et al.: Breast cancer statistics, 2017, racial disparity in mortalit y b y state. In: CA: a cancer journal for clinicians 67 (6). pp439-448. (2017). 2. Kuhl, C., et al.: Prosp ectiv e multicen ter cohort study to refine managemen t rec- ommendations for women at elev ated familial risk of breast cancer: the EV A trial. J Clin Oncol 28.9 (2010): 1450-1457. 3. Zheng, H., et al.: Small Lesion Classification in Dynamic Con trast Enhancement MRI for Breast Cancer Early Detection. International Conference on Medical Im- age Computing and Computer-Assisted Interv ention. Springer, Cham. (2018). 4. Amit, G., et al.: Classification of breast MRI lesions using small-size training sets: comparison of deep learning approac hes. In: Medical Imaging 2017: Computer- Aided Diagnosis. V ol. 10134. International Society for Optics and Photonics, 2017. 5. Amit, G., et al.: Hybrid mass detection in breast MRI com bining unsupervised saliency analysis and deep learning. International Conference on Medical Image Computing and Computer-Assisted Interv ention. Springer, Cham. pp. 594-602. (2017). 6. Maicas, G., et al.: T raining medical image analysis systems lik e radiologists. Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- v ention. Springer, Cham. (2018). 7. W ang, H., et al.: Cosface: Large margin cosine loss for deep face recognition. Pro- ceedings of the IEEE Conference on Computer Vision and P attern Recognition. (2018). 8. Zhou, B., et al.: Learning deep features for scene recognition using places database. Adv ances in neural information pro cessing systems. pp. 487-495. (2014). 9. Gat ys, L., Ec ker, A. S., & Bethge, M.: T exture syn thesis using con volutional neural net works. Adv ances in neural information pro cessing systems (pp. 262-270). (2015). 10. F u, J., et al.: Dual attention netw ork for scene segmen tation. arXiv preprin t arXiv:1809.02983 (2018). 11. F rangi, A. F., et al.: Multiscale vessel enhancemen t filtering. International confer- ence on medical image computing and computer-assisted interv ention. pp. 130-13. Springer, Berlin, Heidelb erg. (1998). Breast MRI Cancer Analysis 9 12. He, K., et al: Deep residual learning for image recognition. Pro ceedings of the IEEE conference on computer vision and pattern recognition. pp. 770-778. (2016). 13. W u, J., et al.: Deep m ultiple instance learning for image classification and auto- annotation. Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015). 14. Zh u, W., et al.: Deep multi-instance netw orks with sparse lab el assignment for whole mammogram classification. International Conference on Medical Im- age Computing and Computer-Assisted Interv ention. Springer, Cham, pp. 603- 611.(2017). 15. LIU, J., et al.: Integrate Domain Knowledge in T raining CNN for Ultrasonography Breast Cancer Diagnosis. International Conference on Medical Image Computing and Computer-Assisted Interv ention. pp. 868-875. Springer, Cham. (2018).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment