A Hybrid Approach Between Adversarial Generative Networks and Actor-Critic Policy Gradient for Low Rate High-Resolution Image Compression

A Hybrid A ppr oach Between Adversarial Generativ e Networks and Actor -Critic P olicy Gradient f or Low Rate High-Resolution Image Compr ession Nicol ´ o Savioli Imperial College London, UK nsavioli@ic.ac.uk Abstract Image compr ession is an essential appr oach for de- cr easing the size in bytes of the image without deterio- rating the quality of it. T ypically , classic algorithms ar e used b ut r ecently deep-learning has been successfully ap- plied. In this work, is pr esented a deep super-r esolution work-ﬂow for image compr ession that maps low-r esolution JPEG image to the high-r esolution. The pipeline consists of two components: ﬁrst, an encoder-decoder neural net- work learns how to tr ansform the downsampling JPEG im- ages to high r esolution. Second, a combination between Generative Adversarial Networks (GANs) and r einforce- ment learning Actor -Critic (A3C) loss pushes the encoder- decoder to indir ectly maximize High P eak Signal-to-Noise Ratio (PSNR). Although PSNR is a fully differ entiable met- ric, this work opens the doors to new solutions for maxi- mizing non-differ ential metrics thr ough an end-to-end ap- pr oach between encoder -decoder networks and r einfor ce- ment learning policy gradient methods. 1. Introduction Image compression with deep learning systems is an ac- tiv e area of research that recently has becomes very com- pelling respect to the modern natural images codecs as JPEG2000, [1], BPG [2] W ebP currently dev eloped by Goog le R  [3]. The new deep learning methods are based on an auto-encoder architecture where the features maps, generate from a Con volutional Neural Networks (CNN) en- coder , are passed through a quantizer to create a binary rep- resentation of them, and subsequently given in input to a CNN decoder for the ﬁnal reconstruction. In this vie w , sev- eral encoders and decoders models ha ve been suggested as a ResNet [4] style network with the parametric rectiﬁed linear units (PReLU) [5], generativ e approach b uild on GANs [6] or with a inno vati ve hybrid networks made with Gated Re- current Units (GR Us) and ResNet [7]. In contrast, this paper proposes a super -resolution approach, build on a modifying version of SRGAN [8], where downsampling JPEG images are con verted at High Resolution (HR) images. Hence, in order to improv e the ﬁnal PSNR results, a Reinforcement Learning (RL) approach is used to indirectly maximize the PSNR function with an A3C policy [9] end-to-end joined with SRGAN. The main contributions of this works are: (i) Propose a compression pipeline based on JPEG image downsampling combined with a super-resolution deep net- work. (ii) Suggest a new way for maximizing not differen- tiable metrics through RL. Ho wev er , e ven if the PSNR met- ric is a fully dif ferentiable function, the proposed method could be used in future applications for non-euclidean dis- tance such as in the Dynamic Time W arping (DTW) algo- rithms [10]. 2. Methods In this section is given a more formal description of the suggested system which includes: the network architecture and the losses used for training. 2.1. Netw ork Ar chitecture The architecture consists of three main blocks: encoder , decoder and a discriminator (Figure 1). The I LR is the low- resolution (LR) input image (i.e compressed with a JPEG encoder) of size r W × r H × C ( i.e with C color chan- nels and W , H the image width and height); where a bicubic downsampling operation with factor r is applied. While the output is an HR image deﬁned as I H R . 2.1.1 Encoder The encoder is basically a ResNet [4], where the ﬁrst con vo- lution block has a kernel size of 9 × 9 and 64 Feature Maps (FM) with a P arametr icReLU activ ation function. Then, ﬁv e Residual Blocks (RB) are stacked together . Each of those RB consists of two con volution layers with kernel size 3 × 3 and 64 FM follo wed by Batch-Normalisation (BN) and P ar ametricR eLU . After that, a ﬁnal con volution block of 3 × 3 and 64 FM are repeated. Ho wev er , the encoder is also 1 Figure 1: The ﬁgure sho ws the proposed RL-SRGAN model composed by encoder, decoder and discriminator networks. The objectiv e of this model is to map the compressed JPEG Low Resolution (LR) Image to the HR. The encoder can be seen as: (i) an RL policy network able to increase its HR prediction through the indirect maximization of the PSNR at each i training iterations. (ii) A GANs, where the discriminator (i.e V GG network) push the decoder to produce images similar to the original HR ground truth. joint with a fully connected layer and, at each i training it- erations, produces an action prediction of the actual PSNR; together with a v alue function V π ( I H R ( i )) (i.e explained in 2.2.2 section). 2.1.2 Decoder The decoder is fundamentally another deep network that al- lows increasing the resolution of the output encoder with eight subpixel layers [11]. 2.1.3 Discriminator The encoder , joint with the decoder , deﬁne a generator H ( · ) θ , where θ = [ w L ; b L ] are the weight and biases pa- rameters for each L-layers for the speciﬁc network. A third network D ( · ) θ , called discriminator , is also optimized con- currently with H ( · ) θ for solving the following adversarial min-max problem: l H R GAN = min θ max θ E I H R ∼ p train ( I H R ) [ log ( D ( I H R ))]+ + E I LR ∼ p H ( I LR ) [ log (1 − D ( H ( I LR )))] (1) The idea behind l H R GAN loss is to train a generati ve model H ( · ) θ to fool D ( · ) θ . Indeed, the discriminator is trained to distinguish super-resolution images I H R , generated by H ( · ) θ , from those of the training dataset. In this way , the discriminator is increasingly struggled to distinguish the I H R images (generated by H ( · ) θ ) from the real ones and consequently dri ving the generator to produce results closer to the HR training images. Then, in the proposed model, the discriminator D ( · ) θ is parameterized through a VGG net- work with Leak yReLU activ ation ( α = 0 . 2) without max- pooling. 2.2. Loss function The accurate deﬁnition of the loss function is crucial for the performance of the H ( · ) θ generator . Here, the para- graph is logically di vided into three losses: the SRGAN loss, the RL loss, and the proposed loss. 2.2.1 SRGAN loss The SRGAN loss is determined as a combination of three other separate losses: MSE loss, VGG loss, and GANs loss. Where the MSE loss is deﬁned as: l H R M S E = 1 W H W X x =1 H X y =1 ( I H R x,y − H θ ( I LR ) x,y ) 2 (2) It represents the most utilized loss in super-resolution methods but remarkably sensiti ve to high-frequency peak with smooth textures [8] . For this reason, is used a VGG loss [8] based on the ReLU acti vation function of a 19 layer VGG (deﬁned here as Ω( · ) ) network: l H R V GG = 1 W H W X x =1 H X y =1 (Ω( I H R ) x,y − Ω( H θ ( I LR ) x,y ) 2 (3) Where W and H are the dimension of I H R image in the MSE loss. Whilst, for the VGG loss, they are the Ω( · ) output FM dimensions. While the GANs loss is previously 2 deﬁned in the equation 1. Finally , the total SRGAN loss is determined as: l H R S RGAN = l H R M S E + 1 e − 3 × l H R GAN + 6 e − 3 × l H R V GG (4) 2.2.2 RL loss The aim of RL loss is to indirectly maximize the PSNR through an actor -critic approach [9]. Giv en Q π ( I LR , P S N R pred ) a map between the low resolution in- put I LR and the current PSNR value prediction P S N R pred (see ﬁg. 1). Thence, at each i training iterations, is calcu- lated the rew ard value as a threshold between the pre vious P S N R at iteration i − 1 and that one to iteration i as fol- lows: r ( i ) = ( 1 , if P S N R i > P S N R i − 1 0 , otherwise (5) where the P S N R ( · ) function is deﬁned as: P S N R = 20 · log 10 ( M AX I ) − 10 · log 10 ( 1 mn m − 1 X l =0 n − 1 X j =0 [ I H R ( l, j ) − I H R g t ( l, j )] 2 ) (6) The M AX I is the maximum pixel value of the HR im- age, I H R is the output encoder HR image, while I H R g t is the corresponding HR ground truth for each pixel ( l, j ) at m × n HR size. The re ward (eq. 5), actually depends on the P S N R pred action taken by the policy for two main reasons: (i) during the training process the P S N R pred becomes an optimal estimator of the decoder output I H R (used in 6). (ii) The latent space between the encoder and the fully connected layer is the same and share equal policy information. Thus, all the re wards are accumulated e very k training steps through the following return function: R ( i ) = ∞ X k =0 γ k r ( i + k ) (7) where γ ∈ (0 , 1] is a discount factor . Therefore, is pos- sible to deﬁne the Q π ( · ) function as an e xpectation of R ( k ) giv en the input I LR and P S N R pred . Q π ( I LR , P S N R pred ) = E [ R ( i ) | I LR ( i ) = I LR , P S N R pred ] (8) T o notice, the encoder , together with the fully connected layer , become the policy network π ( P S N R pred | I LR ( i ); θ H ) . This policy network is parametrized by the standard RE I N F O RC E method on the θ encoder parameters with the following gradient direction: ∇ θ log π ( P S N R pred | I LR ( i ); θ ) · R ( i ) (9) It can be consider an unbiased estimation of ∇ θ · E [ R ( i )] . Especially , to reduce the variance of this e valuation (and keeping it unbiased) is desirable to subtract, from the return function, a baseline V ( I LR ( i )) called value function. The total policy agent gradient is gi ven by: l H R π = log π ( P S N R pred | I LR ( i ) , θ ) · ( R ( i ) − V ( I LR ( i ))) (10) Figure 2: The abov e ﬁgure shows the results for RL- SRGAN compared to SRGAN, LANCZOS, and the orig- inal ground truth. As we can see, LANCZOS simply de- stroys the edges producing artifacts on the global image. Whilst, SRGAN forms noticeable chromatic aberration (i.e transition from orange to yellow color) near the edges. Even though, the RL-SRGAN holds the color uniform with net outline details nearby to the edges; analogous to the origi- nal ground truth. The term R ( i ) − V ( I LR ( i )) can be considered a esti- mation of the advantage to predict P S N R pred for a giv en I LR ( i ) input. Consequently , a learnable ev aluation of the value function is used: V ( I LR ( i )) ≈ V π ( I LR ( i )) . This ap- proach, is further called generativ e actor-critic [9] becouse the P S N R pred prediction is the actor while the baseline V π ( I LR ( i )) is its critic. The RL loss is then calculated as: l H R RL = 5 e − 3 ∗ X ( R ( i ) − V π ( I LR ( i ))) 2 − l H R π (11) 2.2.3 Proposed loss The Proposed Loss (PL) combines both SRGAN loss and RL loss. After ev ery k step (i.e due to the rew ards accumu- lation process at each i training iterations), the l H R RL is added on l H R S RGAN . l H R P L = ( l H R S RGAN + l H R RL , if k = i l H R S RGAN , otherwise (12) 3 Methods PSNR MS-SSIM RL-SRGAN 22.34 0.783 SRGAN 22.15 0.780 LANCZOS 21.44 0.760 T able 1: The table shows the PSNR and MS-SSIM re- sults obtained in the v alidation set for the proposed rl-srgan method with srgan and lanczos upsampling. 2.3. Experiments and Results In this section is e v aluate the method suggested. The dataset used is the CLIC compression dataset [12] corre- spondingly divided in the train, valid and test sets. The train has 1634 HR images, valid 102 and test 330 . The ev aluation metrics used are the PSNR and MS-SSIM [13] for both valid and test. An AD AM optimizer is used with a learning rate of 1 e − 3 within 22876 model iterations until con vergence. The Reinforcement Learning SRGAN (RL- SRGAN) is compared with the SRGAN model work [8] and the Lanczos resampling (i.e a smooth interpolation through a con volution between the I LR image and a stretched sinc ( · ) function). Finally , the table 1 highlights that the PSNR difference between LANCZOS upsampling and RL- SRGAN is 0.9, while of 0.19 with SRGAN; whereas the MS-SSIM remains constant between RL-SRGAN and SR- GAN for the v alidation set. This also sho ws a better accu- racy for the RL-SRGAN model. While, for the tests, RL- SRGAN achie ve 20 . 06 of PSNR and 0 . 7503 of MS-SSIM. Furthermore, the compression rate for the v alidation set im- ages is 3.812.623 bytes respect 362.236.068 bytes of origi- nal HR dataset. While for the test set images is 5.228.411 bytes in contrast with the 5.882.850.012 bytes of the origi- nal one. That makes the method a good trade-off between compression capacity and acceptable PSNR. 2.4. Discussion A modiﬁed version of SRGAN is suggested where an A3C method is joined with GANs. Sadly , the proposed method has strong limitations due to the drastic downsam- pling of the input JPEG image. This do wnsampling causes loss of information, difﬁcult to reco ver from the super- resolution network, which leads to lower results in PSNR and MS-SSIM on the test set (i.e 20 . 06 and 0 . 7503 re- spectiv ely). Despite, the results (table 1) emphasize slight improv ement performances for RL-SRGAN related within SRGAN and a baseline LANCZOS upsampling ﬁlter . How- ev er, the proposed method compresses all test ﬁles in a par - simonious way respect to the challenge methods. Indeed, the total dimension of the compression test set is of 5236870 bytes respect to 15748677 bytes of CLIC 2019 winner . Finally , a new method for maximizing non-differentiable functions is here suggested through deep reinforcement learning technique. References [1] David S. T aubman and Michael W . Marcellin. JPEG2000 : image compr ession fundamentals, standards, and practice / David S. T aubman, Michael W . Marcellin . Kluwer Academic Publishers Boston, 2002. [2] Fabrice bellard. bpg image format. https://bellard. org/bpg . [3] W ebp image format. https://developers.google. com/speed/webp . [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR , abs/1512.03385, 2015. [5] Haojie Liu, T ong Chen, Qiu Shen, T ao Y ue, and Zhan Ma. Deep image compression via end-to-end learning. In The IEEE Conference on Computer V ision and P attern Recogni- tion (CVPR) W orkshops , June 2018. [6] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu T imofte, and Luc V an Gool. Extreme learned image compression with gans. In The IEEE Conference on Com- puter V ision and P attern Recognition (CVPR) W orkshops , June 2018. [7] George T oderici, Damien V incent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor , and Michele Covell. Full resolution image compression with recurrent neural net- works. In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , July 2017. [8] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P . Aitken, Alykhan T ejani, Johannes T otz, Zehan W ang, and W enzhe Shi. Photo-realistic single image super-resolution using a generativ e adversarial network. In 2017 IEEE Conference on Computer V ision and P attern Recognition, CVPR 2017, Honolulu, HI, USA, J uly 21-26, 2017 , pages 105–114, 2017. [9] V olodymyr Mnih, Adri ` a Puigdom ` enech Badia, Mehdi Mirza, Alex Graves, Timothy P . Lillicrap, Tim Harley , David Silver , and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR , abs/1602.01783, 2016. [10] Eamonn J. Keogh and Michael J. Pazzani. Scaling up dy- namic time warping for datamining applications. In Pro- ceedings of the Sixth ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining , pages 285– 289. A CM, 2000. [11] Andrew P . Aitken, Christian Ledig, Lucas Theis, Jose Ca- ballero, Zehan W ang, and W enzhe Shi. Checkerboard ar- tifact free sub-pixel conv olution: A note on sub-pixel con- volution, resize con volution and con volution resize. CoRR , abs/1707.02937, 2017. [12] W orkshop and challenge on learned image compression (clic). http://www.compression.cc/ . [13] Zhou W ang, Eero P . Simoncelli, and Alan C. Bovik. Multi- scale structural similarity for image quality assessment. pages 1398–1402, 2003. 4

A Hybrid Approach Between Adversarial Generative Networks and Actor-Critic Policy Gradient for Low Rate High-Resolution Image Compression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment