Image Super Resolution via Bilinear Pooling: Application to Confocal Endomicroscopy

Image Sup er Resolution via Bilinear P o oling: Application to Confo cal Endomicroscop y Saeed Izadi, Darren Sutton, and Ghassan Hamarneh Sc ho ol of Computing Scienc e, Simon F raser Universit y , Canada { saeedi, darrens, hamarneh } @sfu.ca Abstract. Recen t developmen ts in image acquisition literature hav e miniaturized the confo cal laser endomicroscopes to improv e usabilit y and ﬂexibilit y of the apparatus in actual clinical settings. How ever, miniatur- ized devices collect less ligh t and hav e few er optical comp onen ts, resulting in pixelation artifacts and lo w resolution images. Owing to the strength of deep net works, many supervised metho ds known as sup er resolution ha ve achiev ed considerable success in restoring low resolution images b y generating the missing high frequency details. In this work, w e prop ose a nov el attention mechanism that, for the ﬁrst time, com bines 1st- and 2nd-order statistics for p o oling op eration, in the spatial and channel-wise dimensions. W e compare the eﬃcacy of our method to 10 other exist- ing single image sup er resolution techniques that compensate for the reduction in image qualit y caused by the necessity of endomicroscop e miniaturization. All ev aluations are carried out on three publicly av ail- able datasets. Exp erimen tal results show that our metho d can pro duce sup erior results against state-of-the-art in terms of PSNR, and SSIM metrics. Additionally , our prop osed metho d is ligh tw eight and suitable for real-time inference. 1 In tro duction Colorectal cancer is known as the fourth most-common cancer and remains one of the leading causes of cancer related mortality in the world. In 2018, more than 1 million p eople were aﬀected b y colorectal cancer worldwide, resulting in an estimated 550,000 deaths [2]. Rapid histopathologic assessment is an important to ol that may improv e disease prognosis b y detecting early-stage cancer and pre- cancerous conditions. Although biopsy and ex-vivo tissue examination are widely accepted as the diagnostic gold standard, such pro cedures tak e time and may limit the ability of the endoscopist to rapidly gauge disease sev erity . Confo cal laser endomicroscop y (CLE), on the other hand, has substantially impro ved real- time in-vivo visualization of the subsurface of living cells, v ascular structures, and tissue patterns during endoscopic examination [10]. 2 F or in-vivo histological examination, the large size of the microscop e com- plicates navigation of the in terior of the b ody in a clinical setting. Therefore, it is necessary to reduce the size of the microscop e to completely and safely access the organ(s) of in terest. Ho wev er, miniaturization reduces the n umber of optical elements in the microscop e prob e, introducing pixelation artifacts in the acquired images. One strategy to remov e image artifacts and enhance image qualit y is to directly p ost-pro cess degraded images. An emerging process in the ﬁeld of image pro cessing, referred to as single image super-resolution (SR), aims to reconstruct an accurate high-resolution (HR) image giv en its low-resolution (LR) counterpart. Thus, SR is a promising softw are metho d to mitigate image degradation due to hardw are miniaturization. Among traditional SR algorithms, Huang et al. [8] prop osed leveraging self- similarit y mo dulo aﬃne transformations to accommodate natural deformation of recurring statistical priors within and across scales of an image. Timofte et al. [18,19] used a combination of neighbour embedding and sparse dictionary learning ov er an external database and proposed anc hored neigh b orhoo d regres- sion in the dictionary atom space. Recently , CNNs hav e adv anced the SR research ﬁeld b y directly learning the mapping b et ween LR and HR images [4,11,12,13,1]. Dong et al. [4] demonstrated that a fully conv olutional netw ork trained end- to-end can p erform LR-to-HR nonlinear mapping. Kim et al. [11] suggested a trained net work to predict additive details in the form of a residual image, whic h is summed with the interpolated image. Kim et al. [12] addressed model o ver- ﬁtting b y reducing the num b er of parameters via recursiv e con volutional la yers. Lai et al. [13] designed a net work whic h progressiv ely reconstructs the s ub-band residuals of high-resolution images at multiple pyramid levels. Ahn et al. [1] impro ved sp eed and eﬃciency of SR mo dels by designing a cascade mechanism o ver residual net works. Lastly , Cheng et al. [3] exploited recursiv e squeeze and excitation mo dules in a netw ork to exploit relationships betw een c hannels. Izadi et al. [9] reported the ﬁrst attempt to deplo y CNNs on CLE images. They used a densely connected CNN to transform syn thetic LR images in to HR ones. Ra vi et al. [15] emplo yed a CNN to restore missing details into LR images. They collected a set of consecutive LR frames and generated synthetic HR images using a video registration technique. In a more recen t study [16], Ravi et al. trained a CNN for unsup ervised SR on CLE images using a cycle consistency regularization, designed to imp ose acquisition prop erties on the sup er-resolv ed images. In this pap er, w e present a ligh tw eight con volutional neural net work (CNN) that is appropriate for frame-wise SR by incorporating a nov el attention mec h- anism. In contrast to SESR [3], which leverages atten tion mo dules from the Squeeze-and-Excitation netw ork (SENet) [7] to re-weigh t c hannels, we in tro- duce a no vel w eighting sc heme to recalibrate learned features based on pairwise relationships. Our attention mo dules compromise b oth 1 st -order p ooling and 2 nd -order po oling (a.k.a. bilinear po oling), impro ving the quality of learned fea- tures in the netw ork by considering pairwise correlations along feature channels 3 Fig. 1: (a) The o verall arc hitecture of our prop osed net work. (b) RBAM arc hitecture. (c) c hannel-wise and (d) spatial attention architectures. and spatial regions [5]. The compactness and computational speed of our net- w ork lends well to real-tim e implementation during in-vivo examination. W e demonstrate that stacking atten tion mo dules in the middle of a low-lev el feature extraction head and a feature in tegration tail quantitativ ely and qualitatively pro duces sup erior results against existing SR methods and generalizes w ell ov er unseen microscopic datasets. 2 Metho d Net work Ov erview. Fig. 1-a depicts the ov erall architecture of our prop osed LR-to-SR netw ork. Let L LR ∈ R 1 × W × H , I SR ∈ R 1 × rH × rW , and r denote the lo w resolution input and super-res olv ed output, and the downsample factor, resp ectiv ely . W e use a conv olution la yer, denoted by F ( · ), with a 3 × 3 kernel and C output channels to extract initial features H 0 ∈ R C × H × W , i.e. H 0 = F c (I LR ; θ 0 ) , (1) where θ refers to the learnable parameters. In our proposed netw ork, the initial features H 0 are up dated by sequential residual atten tion mo dules, denoted as G ( · ) and a skip connection. The entire high-lev el feature extraction stage is denoted as B ( · ): H B = B (H 0 ) = G b ( G b − 1 ( ... ( G 1 (H 0 )) ... )) + H 0 . (2) T o upsample the feature maps, w e use sub-pixel con volutions, denoted as U ( · ), follo wed b y a single channel 1 × 1 conv olution for SR reconstruction: I SR = F 1 ( U (H B ; θ up ); θ rec ) . (3) Residual Bilinear Atten tion Module. In our proposed RBAM, w e com bine 1 st - and 2 nd -order po oling op erations spatially and channel-wise to recalibrate learned features for eﬃcien t netw ork training. Fig. 1-b illustrates the structure 4 of our prop osed RBAM. Mathematically , we form ulate RBAM as: H b = G b (H b − 1 ) = Q b (H b − 1 ) + H b − 1 , (4) where Q ( · ) denotes the atten tion mo dules before the skip connection. Giv en the input feature maps H b ∈ R C × H × W , tw o con volutions with 3 × 3 k ernel size in terleav ed with a ReLU activ ation function are p erformed to pro duce high-lev el feature maps H b conv ∈ R C × H × W as input to the atten tion branches: H b conv = F c ( F c (H b − 1 ; θ b 1 ); θ b 2 ) . (5) Channel-wise Atten tion (CA) Branc h. CA leverages the in ter-channel cor- resp ondence b et ween feature resp onses (Fig. 1-c). 1 st - and 2 nd -order p ooling mec hanisms op erate on H b conv , producing t wo vectors F 1st ca , F 2nd ca ∈ R C × 1 × 1 . F 1st ca is the 1 st -order CA obtained b y spatial av erage po oling to squeeze the feature map of eac h c hannel [7]. T o obtain 2 nd -order CA, pairwise channel correlations are computed in the form of a cov ariance matrix Σ ∈ R C × C b y spatial ﬂatten- ing, dimension p erm utation, and matrix multiplication. Each ro w in Σ enco des the statistical dep endency of a channel with respect to every other channel [5]. Giv en the cov ariance matrix Σ, we adopt a row-wise con volution with 1 × C k ernel size to produce the 2 nd -order CA v ector F 2nd ca . Finally , t wo successive 1-D con volutions interlea ved with a ReLU activ ation function op erate on a v ector formed b y the sum of F 1st ca + F 2nd ca . The output of the conv olution op eration is fed in to a sigmoid function σ , follo w ed b y elemen t-wise multiplication ⊗ to pro duce the b th up dated features maps H b ca : H b ca = H b conv ⊗ σ ( F c ( F c 4 (F 1st ca + F 2nd ca ; θ b 3 ); θ b 4 )) . (6) Spatial A ttention (SA) Branch. SA indicates shared corresp ondence b et ween spatial regions across all feature maps (Fig. 1-d). Given H b conv as the input, the 1 st -order spatial attention matrix, F 1st sa ∈ R 1 × H × W , is computed by the a verage p ooling operation along channel dimension to aggregate information for eac h spatial lo cation across all features. T o compute 2 nd -order spatial atten tion matrix, F 2nd sa ∈ R 1 × H × W , we ﬁrst reduce the spatial size of feature maps to H 0 × W 0 (8 × 8 in our implemen tation) b y applying a verage po oling. Then, appropriate reshaping, dimension p erm utation and matrix multiplication is adopted to obtain the co v ariance matrix Σ ∈ R H 0 W 0 × H 0 W 0 . Similar to channel-wise atten tion, a ro w-wise conv olution with 1 × H 0 W 0 k ernel size is applied on Σ. Even tually , dimension p erm utation and nearest neighbor interpolation pro duce F 2nd sa . W e add these t wo matrices together element-wise and apply a conv olution with 1 × 1 kernel size that feeds a sigmoid function. Spatial atten tion is realised b y elemen t-wise multiplication o ver all feature maps, form ulated as: H b sa = H b conv ⊗ σ ( F c (F 1st sa + F 2nd sa ; θ b 5 )) (7) 5 Fig. 2: Qualitativ e results and their PSNR scores at × 4 SR. Eac h ro w sho ws the side- b y-side comparison of HR with a) bicubic, b) GR, c) SESR, and d) RBAM across three datasets. HR images are sho wn for each pair for ease of visual comparison. A ttention F usion. The up dated features are concatenated (+ +) and aggregated via a con volution with kernel 1 × 1 kernel. Lastly , H b is added via skip connection: H b = F c (H b ca + + H b sa ; θ b 6 ) + H b − 1 . (8) 3 Results and Discussion Data . W e ev aluate existing state-of-the-art SR metho ds, as w ell as our prop osed RBAM, on three publicly a v ailable CLE datasets (T able 1). W e select images rich in texture by assessing the SR performance of bicubic in terp olation on the unseen test set. As depicted in Fig. 3, images with PSNR scores b elo w the mean PSNR score of the bicubic method ev alulated on the test set are deemed ’texture rich’, and are used for ev aluation, whereas images associated with scores abov e the mean are deemed ’texture p o or’. In other words, images which can be eﬀectively restored using bicubic interpolation are rejected for ev aluation, as they con tain little information on whic h to assess the p erformance of state-of-the-art metho ds. Ev aluation assesses the metho ds’ ability to reconstruct 1024 × 1024 HR image from a syn thesized LR counterpart obtained via bicubic do wnsampling with the appropriate factor ( × 2 or × 4). T raining Settings . W e train all metho ds on a random partition (80%) of CLE100, and ev aluate them on the remaining 20% as w ell as CLE200, and 6 T able 1: Details of the datasets used in our ev aluation. dataset pro vided b y #patien ts #images anatomical site image size CLE100 Leong e t al. [14] 30 181 small in testine 1024 × 1024 CLE200 Grisan et al. [6] 32 262 esophagus 1024 × 1024 CLE1000 S ¸ tef˘ anescu et al. [17] 11 1025 colorectal mucosa 1024 × 1024 Fig. 3: (a) Examples of images from the partitioned test set. Images b elonging to the ’texture rich’ partition are used for ev aluation. CLE1000. F or DL-based metho ds, we replicated the rep orted training settings, and used public code for traditional algorithms. F or our model, we use B = 5 RBAMs and set the n umber of features to C = 64 to create a ligh tw eight net- w ork. In each training batch, 16 LR patches of size 48 × 48 are randomly extracted as inputs, and augmen ted by random 90 ◦ rotations and horizon tal/vertical ﬂip. W e use Adam optimizer and L1 loss to train our netw ork for 300 ep o c hs. Initial learning rate is set to 10 − 4 and is halv ed every 50 epo c hs. Ablation In vestigation . W e discern the eﬀectiveness of the individual com- p onen ts in our netw ork mo dules b y ablating atten tion blo c ks and ev aluating p erformance after 50 epo c hs. Our in vestigation shows that, for CLE100 at × 2 SR, attention-based v ariants outperform the baseline, demonstrating the mer- its of incorp orating spatial and c hannel-wise contextual information. W e also observ ed that using b oth 1 st and 2 nd -order p ooling operations sim ultaneously outp erform using either 1 st or 2 nd -order channel-wise p o oling individually . W e similarly note that using both spatial and c hannel-wise attention outp erforms either one alone. Comparison to State-of-the-art. W e compare the p erformance of traditional algorithms including ANR [18], GR [18] and A+ [19], as w ell as DL-based tech- niques including SR CNN [4], VDSR [11], DR CN [12], LapSRN [13], SESR [3] and our proposed RBAM. T able 2 summarizes the quantitativ e comparisons in terms of p eak signal-to-noise-ratio (PSNR-SEM), structural similarit y (SSIM), and in- ference time at × 2, and × 4 SR. F rom the table, one can see that most DL-based metho ds consistently outperform traditional SR algorithms in PSNR and SSIM metrics. P articularly , RBAM signiﬁcan tly outp erforms the mean PSNR ov er all datasets b y 0.18dB and 0.13dB for × 2 and × 4 SR, resp ectiv ely . F urthermore, RBAM is a practical compromise b et w een inference time, and generalization. Our results show a mo derate quantitativ e increase in PSNR score and a consid- 7 T able 2: Quan titative results of SR mo dels at × 2 and × 4 factors. Bold indicates the b est result. D and = denote traditional and DL-based metho ds, resp ectiv ely . PSNR scores are rep orted with the standard error of the mean (SEM) for eac h method. Methods CLE100 CLE200 CLE1000 time Scale × 2 PSNR SSIM PSNR SSIM PSNR SSIM Bicubic 33.69 ± 0.06 0.8693 35.53 ± 0.01 0.9029 34.45 ± 0.01 0.8920 0.02 A+ D [19] 34.22 ± 0.07 0.8928 36.14 ± 0.01 0.9218 35.04 ± 0.01 0.9114 6.72 ANR D [18] 36.44 ± 0.13 0.9226 39.10 ± 0.01 0.9559 37.64 ± 0.01 0.9559 6.07 GR D [18] 36.56 ± 0.13 0.9243 39.26 ± 0.01 0.9579 37.79 ± 0.01 0.9448 4.47 SRCNN = [4] 35.75 ± 0.11 0.9181 38.25 ± 0.01 0.9494 36.87 ± 0.01 0.9380 0.06 VDSR = [11] 36.72 ± 0.13 0.9276 39.31 ± 0.01 0.9578 37.89 ± 0.01 0.9462 0.25 DRCN = [12] 36.65 ± 0.13 0.9257 39.29 ± 0.01 0.9575 37.83 ± 0.01 0.9452 0.48 LapSRN = [13] 36.71 ± 0.13 0.9264 39.25 ± 0.01 0.9583 37.91 ± 0.01 0.9462 0.07 SESR = [3] 36.76 ± 0.13 0.9282 39.36 ± 0.01 0.9583 37.91 ± 0.01 0.9462 0.27 RBAM (Ours) = 36.91 ± 0.12 0.9321 39.45 ± 0.01 0.9590 38.22 ± 0.01 0.9501 0.18 Scale × 4 Bicubic 31.29 ± 0.04 0.6673 32.45 ± 0.01 0.7318 31.78 ± 0.01 0.7278 0.02 A+ [19] 31.57 ± 0.04 0.7042 32.76 ± 0.01 0.7607 32.06 ± 0.01 0.7517 3.03 ANR [18] 31.68 ± 0.04 0.7160 32.93 ± 0.01 0.7736 32.23 ± 0.01 0.7671 2.88 GR [18] 31.70 ± 0.04 0.7201 32.95 ± 0.01 0.7736 32.25 ± 0.01 0.7703 2.31 SRCNN [4] 31.59 ± 0.04 0.7073 32.76 ± 0.01 0.7617 32.07 ± 0.01 0.7566 0.06 VDSR [11] 31.66 ± 0.04 0.7144 32.86 ± 0.01 0.7804 32.16 ± 0.01 0.7635 0.25 DRCN [12] 31.70 ± 0.04 0.7214 32.92 ± 0.01 0.7750 32.21 ± 0.01 0.7635 0.48 LapSRN [13] 31.68 ± 0.04 0.7190 32.76 ± 0.01 0.7617 32.29 ± 0.01 0.7737 0.08 SESR [3] 31.76 ± 0.04 0.7249 32.99 ± 0.01 0.7804 32.29 ± 0.01 0.7737 0.33 RBAM (Ours) 31.84 ± 0.04 0.7315 33.11 ± 0.01 0.7852 32.47 ± 0.01 0.7874 0.07 erable increase in qualitativ e p erformance - this is similar to previous works in single image sup er resolution [20]. Fig. 2 sho ws selected image patches from eac h dataset for qualitativ e assessment. RBAM can delicately restore high-frequency cues, such as gran ular textures and sudden changes in gra yscale pixel in tensity . This manifests qualitatively in the form of improv ed restoration of high frequency details suc h as cell membranes (CLE200, CLE1000 examples) and intracellular spaces (CLE100 example). Motiv ation for Bilinear P o oling. W e combine 1 st -order and 2 nd -order po ol- ing to recalibrate learned features based on channels that activ ate often or corre- sp ond to feature rich inputs, resp ectiv ely . Channels that activ ate often are likely resp onding to common, low frequency image features. Conv ersely , channels that are highly correlated may b e resp onding to feature rich instances in the image space that activ ate m ultiple ﬁlters simultaneously . High frequency features tend to b e complex, and not as common semantically compared to low frequency im- age details. Therefore, c hannels that learn complex image features ma y not b e emphasized b y ﬁrst order p o oling operations alone. Combining ﬁrst and second order po oling in an attention mo dule assures that hard w orking channels are rew arded without diminishing the optimization of c hannels that learn complex features in the lo w to high resolution image mapping space. 8 4 Conclusion W e prop osed the ﬁrst net work that sim ultaneously leverages both ﬁrst and sec- ond order statistics for p ooling in spatial and channel-wise atten tion mechanisms, resulting in a light weigh t and fast mo del that restores high frequency image de- tails. W e compared our prop osed mo del with v arious traditional and DL-based SR techniques on three CLE datasets in terms of image quality assessment met- rics and inference time. Our RBAM netw ork outperforms existing light weigh t metho ds across diﬀeren t datasets, downsampling factors, and SR performance ev aluation criteria. Exp erimen tal results also highlight the p oten tial applica- bilit y of inexp ensiv e soft ware-based p ost-processing SR mo dules that impro ve degraded images in miniaturized CLE devices in real-time. Ac knowledgmen ts . Thanks to the NVIDIA Corp oration for the donation of Titan X GPUs used in this researc h and to the Collaborative Health Research Pro jects (CHRP) for funding. References 1. N. Ahn et al. F ast, accurate, and light weigh t super-resolution with cascading residual n etw ork. In ECCV , 2018. 2. F. Bray et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 coun tries. CA: a c anc er journal for clinicians , 68(6 ):394–424, 2018. 3. X. Cheng et al. Sesr: Single image super resolution with recursive squeeze and excitation n etw orks. In IEEE ICPR , pages 147–152, 2018. 4. C. Dong et al. Image sup er-resolution using deep conv olutional netw orks. IEEE TP AMI , 38(2):295–307, 2016. 5. Gao et al. Global second-order p ooling neural netw orks. , 2018. 6. E. Grisan et al. 239 computer aided diagnosis of barrett’s esophagus using confocal laser end omicroscopy: Preliminary data. Gast. Endosc. , 75(4,):AB126, 2012. 7. J. Hu et al. Squeeze-and-excitation netw orks. In IEEE CVPR , 2018. 8. J. Huang et al. Single image sup er-resolution from transformed self-exemplars. In IEEE CVPR , pages 5197–5206, 2015. 9. S. Izadi et al. Can deep learning relax endomicroscopy hardw are miniaturization requiremen ts? In MICCAI 2018 , pages 57–64, 2018. 10. R. Kiesslic h et al. Confocal laser endoscopy for diagnosing in traepithelial neoplasias and colo rectal cancer in viv o. Gastro enter ology , 127(3):706–713, 2004. 11. J. Kim et al. Accurate image super-resolution using v ery deep con volutional net- w orks. In IEEE CVPR , pages 1646–1654, 2016. 12. J. Kim et al. Deeply-recursive conv olutional netw ork for image sup er-resolution. In IEEE CVPR , pages 1637–1645, 2016. 13. W. Lai et al. F ast and accurate image sup er-resolution with deep laplacian pyramid net works. IEEE TP AMI , pages 1–1, 2018. 14. R. W. Leong et al. In viv o confocal endomicroscopy in the diagnosis and ev aluation of celiac disease. Gastr o enter olo gy , 135(6):1870 – 1876, 2008. 15. D. Rav ` ı et al. Eﬀective deep learning training for single-image sup er-resolution in endomicroscop y exploiting video-registration-based reconstruction. International Journal of Computer Assiste d R adiolo gy and Sur gery , 13:917–924, 2018. 9 16. D. Ra v et al. Adversarial training with cycle consistency for unsupervised sup er- resolution in endomicroscop y . Medic al Image A nalysis , 53:123 – 131, 2019. 17. D. S ¸ tef˘ anescu et al. Computer aided diagnosis for confo cal laser endomicroscopy in adv anced colorectal adenocarcinoma. PloS ONE , 11(5):e0154863, 2016. 18. R. Timofte et al. Anc hored neighborho od regression for fast example-based sup er- resolution. In IEEE ICCV , pages 1920–1927, 2013. 19. R. Timofte et al. A+: Adjusted anc hored neighborho od regression for fast sup er- resolution. In A CCV , pages 111–126, 2015. 20. W. Y ang, X. Zhang, Y. Tian, W. W ang, J.-H. Xue, and Q. Liao. Deep learning for single im age super-resolution: A brief review. IEEE T r ans. on Multime dia , 2019.

Image Super Resolution via Bilinear Pooling: Application to Confocal Endomicroscopy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment