Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression

111 Joint Hierarchical Priors and A daptive Spatial Resolution for Eicient Neural Image Compression AHMED GHORBEL, Univ . Rennes, INSA Rennes, CNRS, IETR - UMR 6164, France W ASSIM HAMIDOUCHE, T echnology Innovation Institute, Masdar City , P .O Box 9639, U AE LUCE MORIN, Univ . Rennes, INSA Rennes, CNRS, IETR - UMR 6164, France Recently , the performance of neural image compression ( NIC ) has steadily improved thanks to the last line of study , reaching or outperforming state-of-the-art conventional codecs. Despite signicant progr ess, current NIC methods still rely on ConvNet-base d entropy co ding, limited in modeling long-range dependencies due to their local connectivity and the increasing number of ar chite ctural biases and priors, resulting in complex underperforming models with high decoding latency . Motivated by the eciency investigation of the Tranformer-based transform coding framework, namely SwinT -ChARM, we propose to enhance the latter , as rst, with a more straightforward yet eective T ranformer-base d channel-wise autoregressive prior model, resulting in an absolute image compression transformer ( ICT ). Through the pr oposed ICT , we can captur e both global and local contexts from the latent representation and better parameterize the distribution of the quantized latents. Further , we leverage a learnable scaling mo dule with a sandwich ConvNeXt-based pre- /post-processor to accurately extract more compact latent codes while reconstructing higher-quality images. Extensive experimental results on b enchmark datasets showed that the proposed framework signicantly improves the trade-o between coding eciency and decoder complexity over the versatile video co ding ( VVC ) reference encoder (VTM-18.0) and the neural codec SwinT -ChARM. Moreover , we provide mo del scaling studies to verify the computational eciency of our appr oach and conduct several objectiv e and subjective analyses to bring to the fore the performance gap between the adaptive image compression transformer ( AICT ) and the neural codec SwinT -ChARM. All materials, including the source code of AICT , will be made publicly accessible upon acceptance for reproducible research. CCS Concepts: • Computing methodologies → Perception; • Human-centered computing → HCI design and evaluation metho ds ; • Visualization → Visual analytics . Additional K ey W or ds and Phrases: Neural Image Compression, Adaptive Resolution, Spatio-Channel Entrop y Modeling, Self-attention, Transformer . A CM Reference Format: Ahmed Ghorbel, W assim Hamidouche, and Luce Morin. 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Ecient Neural Image Compression. J. A CM 37, 4, Article 111 ( August 2023), 23 pages. https: //doi.org/XXXXXXX.XXXXXXX 1 INTRODUCTION Visual information is crucial in human development, communication, and engagement, and its compression is necessary for eective storage and transmission over constrained wireless and A uthors’ addresses: Ahmed Ghorbel, ghorbel.ahmd@gmail.com, Univ . Rennes, INSA Rennes, CNRS, IETR - UMR 6164, Rennes, France; Wassim Hamidouche, Wassim.Hamidouche@tii.ae, T echnology Innovation Institute, Masdar City , P.O Box 9639, Abu Dhabi, U AE; Luce Morin, luce.morin@insa- rennes.fr, Univ . Rennes, INSA Rennes, CNRS, IETR - UMR 6164, Rennes, France. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pr ot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than A CM must be honored. Abstracting with credit is permitted. T o copy other wise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. © 2023 Association for Computing Machinery . 0004-5411/2023/8- ART111 $15.00 https://doi.org/XXXXXXX.XXXXXXX J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:2 Ghorbel et al. wireline channels. Thus, thinking about ne w enhanced lossy image compression solutions is a goldmine for scientic research. The goal is to reduce an image le size by permanently r emoving redundant data and less critical information, particularly high frequencies, to obtain the most compact bit-stream representation while preserving a certain level of visual delity . Despite this, optimizing the rate-distortion trade o involves a fundamental obje ctive for achieving a high compression ratio and low distortion. Fig. 1. A high-level diagram of the proposed AICT solution. ChARM refers to the T ransformer-base d channel- wise autoregressive prior model, and 𝑠 represents the r esizing parameter predicted by the neural estimator ( 𝑠 ∈ R ∩ [ 0 , 1 ] ) . Conventional image and vide o compression standards, including JPEG [ 60 ], JPEG2000 [ 19 ], H.265/high-eciency video coding ( HEV C ) [ 54 ], and H.266/versatile video coding ( VVC ) [ 9 ], rely on hand-crafted creativity within a block-based encoder/de coder diagram [ 52 ]. In addition, recent conventional codecs [ 3 , 48 , 57 , 61 ] employ intra-prediction, xed transform matrices, quantization, context-adaptive arithmetic encoders, and various in-loop lters to reduce spatial and statistical redundancies and alleviate coding artifacts. Ho wever , it has taken several years to standar dize a conventional codec. Moreover , existing image compression standards ar e not anticipated to be an ideal and global solution for all typ es of image content due to the rapid development of new image formats and the growth of high-resolution mobile de vices. On the other hand, with recent advancements in machine learning and articial intelligence, new neural image compression ( NIC ) schemes [ 41 ] have emerged as a promising alternative to traditional compr ession methods. NIC consists of three modular parts: transform, quantization, and entropy coding. Each of these components can b e repr esented as follows: i) autoencoders as e xible nonlinear transforms where the encoder (i.e., analysis transform) extracts a latent r epresentation from an input image and the deco der (i.e., synthesis transform) reconstructs the image fr om the decoded latent, ii) dierentiable quantization that quantizes the encoded latent iii) deep prior model estimating the conditional probability distribution of the quantized latent to reduce the rate. Further , these three components are jointly optimized in end-to-end training by minimizing the distortion loss between the original image and its reconstruction, and the rate ne eded to transmit the bit-stream of latent representation. Recently , we have seen a signicant surge of deep learning-based lines of study exploring the potential of articial neural networks ( ANN s) to develop various NIC frameworks, reaching or even outperforming state-of-the-art conventional codecs. Some of these previous works lever- age hyperprior-related side information [ 7 , 24 ] to capture short-range spatial dependencies or J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:3 additional context model [ 33 , 45 ], and others use non-local me chanism [ 13 , 14 , 34 , 50 ] to model long-range spatial dependencies. For example, Mentzer et al. [ 44 ] proposed a generative compres- sion method achieving high-quality reconstructions. In contrast, Minnen et al. [ 46 ] introduced channel-conditioning and latent residual prediction, taking advantage of an entropy-constrained model that uses both forward and backward adaptations. Current resear ch trends has focused on attention-guided compressive transform, as Zhu et al. [ 68 ] replaced the ConvNet-based transform coding in the Minnen et al. [ 46 ] architecture with a Transformer-based nonlinear transform. Later , Zou et et al. [ 69 ] combined the local-aware attention mechanism with the global-r elated feature learning and proposed a window-based attention mo dule. An additional series of eorts have addressed new entropy coding methods, as Zhu et al. [ 67 ] proposed a probabilistic vector quantiza- tion with cascaded estimation to estimate pairs of mean and covariance under a multi-codebo oks structure. Guo et al. [ 21 ] introduced the concept of separate entropy coding by dividing the latent representation into tw o channel groups, and pr oposed a causal context model that makes use of cross-channel redundancies to generate highly informativ e adjacent contexts. Further , Kim et al. [ 30 ] exploited the joint global and local hyperprior information in a content-dependent manner using an attention mechanism. He et al. [ 22 ] adopte d stacked residual blocks as nonlinear transform and multi-dimension entropy estimation model. Mor e recently , El-Nouby et al. [ 17 ] replaced the vanilla vector quantizer with product quantization ( PQ ) [ 26 ] in a compression system derived from vector-quantized variational autoencoder ( V Q- V AE ) [ 58 ] oering a large set of rate-distortion points and then introduce d a novel masked image mo deling ( MIM ) conditional entropy model that improves entropy coding by modeling the co-dependencies of the quantize d latent codes. Also, Muckley et al. [ 47 ] introduced a ne w adversarial discriminator based on V Q- V AE that optimizes like- lihood functions in the neighb orhood of local images under the mean-scale hyp erprior Minnen et al. [ 45 ] architecture. Additionally , Chen et al. [ 10 ] formulated an improved method calle d Adaptive V Q- V AE to compactly represent the latent space of conv olutional neural network. Further , Xue et al. [ 64 ] proposed an exponential R- 𝜆 model for accurate bitrate estimation, along with a multi-layer feature modulation mechanism in the compression network to ensure monotonic bitrate variation with 𝜆 . Moreover , Lv et al. [ 39 ] proposed a low-rank adaptation approach, which updates decoder weights using low-rank decomposition matrices at inference time, and a dynamic gating network that learns the optimal number and positions of adaptation. Jiang et al. [ 27 ] introduced the Multi- Reference Entropy Model ( MEM ) and its advance d version, MEM +, designed to capture diverse correlations in the latent representations by employing attention map and enhanced checkerb oard context for capturing both local and global spatial contexts. Other interesting attempts [ 16 , 53 ], known as co ordinate-based or implicit neural r epresentations, have shown good ability to represent, generate, and manipulate various data types, particularly in NIC by training image-specic networks that map image coordinates to RGB values, and compress- ing the image-specic parameters. On the other hand, Wu et al. [ 62 ] proposed a learned blo ck-based hybrid image compression method, which introduces a contextual prediction module to utilize the relationship between adjacent blocks, and propose a boundar y-aware post-processing module to remove the block artifacts. Through these numerous pioneering works, we can estimate the importance of NIC in the research eld and the industry . Thus, identifying the main open challenges in this area is crucial. The rst one is to discern the most rele vant information necessary for the reconstruction, knowing that information overlooked during encoding is usually lost and unrecoverable for decoding. The second challenge is to enhance the trade-o b etween coding eciency and decoding latency . While the existing approaches improve the transform and entropy coding accuracy , they still need to improve the decoding latency and reduce the model complexity , leading to an ineective real-world deployment. T o tackle those challenges, we propose a nonlinear transform coding and J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:4 Ghorbel et al. channel-wise autoregr essive entropy co ding built on Swin T ransformer [ 36 ] blocks and pair ed with a neural scaling network, namely adaptive image compression transformer ( AICT ). Fig. 1 portrays a high-level diagram to pro vide a more comprehensiv e overview of the proposed framework. The contributions of this paper are summarized as follows: • W e propose the image compression transformer ( ICT ), a nonlinear transform coding and spatio-channel autoregressive entropy coding. These modules are based on Swin Transformer blocks for eective latent decorrelation and a more exible receptive eld to adapt to contexts requiring short/long-range information. • W e propose the AICT model that adopts a scale adaptation module as a sandwich processor to enhance compression eciency . This module consists of a neural scaling network, and ConvNeXt-based [ 37 ] pre-/post-processor to optimize dierentiable r esizing layers jointly with a content-dependent resize factor estimator . • W e conduct extensive experiments on four widely-use d benchmark datasets to explore possible coding gain sources and demonstrate the ee ctiveness of AICT . In addition, we carried out a model scaling analysis and an ablation study to substantiate our architectural decisions. The experimental r esults reveal the impact of the spatio-channel entropy co ding, the sandwich scale adaptation component, and the joint global structure and local texture learned by the attention units through the nonlinear transform coding. These experiments show that the propose d ICT and AICT achieve respectively -4.65% and -5.11% BD-rate (PSNR) r eduction over VTM-18.0 while considerably reducing the decoding latency , outp erforming conventional and neural codecs in the trade-o between coding eciency and deco ding complexity . The rest of this paper is organized as follows. First, Se ction 2 briey describes the background and related works. Then, Section 3 presents our ov erall framework along with a detailed description of the proposed architecture. Further , we devote Section 4 to present and analyze the experimental results. Finally , Section 5 concludes the pap er . 2 BA CKGROUND AND RELA TED WORKS Over the past years, resear ch has renew ed interest in modeling image compression as a learning problem, giving a series of pioneering w orks [ 7 , 14 , 15 , 24 , 33 , 35 , 40 , 45 , 46 ] that hav e contributed to a universal fashion eect, and have achieved great success, augmented by the ecient connection to variational learning [ 2 , 18 , 20 ]. In the early stage, some of these methods adopte d ConvNets and activation layers coupled with generalize d divisive normalization ( GDN ) layers to perform non-linear transform coding over a variational autoencoder ( V AE ) architecture. This framework creates a compact r epresentation of the image by enco ding them to a latent representation. The compressive transform squeezes out the redundancy in the image with dimensional reduction and entropy constraints. Following that, some studies focus on dev eloping network architectures that extract compact and ecient latent representation while providing higher-quality image reconstruction. This section reviews relevant NIC techniques, including works related to our research, while focusing on the following asp ects. First, we briey present the autoregressive context related works. Then, we describe the end-to-end NIC methods that have recently emerged, including attention-guided and Transformer-based coding. Finally , we introduce adaptive downsampling within the context of neural coding. J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:5 2.1 A utoregressive Context Following the success of autoregressive priors in probabilistic generative mo dels, Minnen et al. [ 45 ] was the rst to introduce autoregressiv e and hierarchical priors within the variational image compression framework, featuring a mean-scale hyperprior . An additional context model is added to boost the rate-distortion performance. Although the combined model demonstrated superior rate-distortion performance compared to neural codecs, it came with a notable computational cost. Later , Cheng et al. [ 14 ] proposed the rst model achieving competitive coding performance with VVC , using a context model in an autoregressive manner . They improved the entrop y model by using a discretized K -component gaussian mixture model ( GMM ). In addition, Minnen et al. [ 46 ] estimated the latent distribution’s mean and standard deviation in a channel-wise manner and incorporated an autoregressive conte xt model to condition the already-decoded latent slices and the latent rounding residual on the hyp erprior to further reduce the spatial redundancy between adjacent pixels. Finally , He et al. [ 23 ] proposed a parallelizable spatial context model based on the checkerboard-shape d convolution that allows parallel-friendly decoding implementation, thus increasing the decoding spee d. 2.2 Aention-Guided Coding Attention mechanism was popularized in natural language processing ( NLP ) [ 38 , 59 ]. It can be described as a mapping strategy that queries a set of key-value pairs to an output. For example, V aswani et al. [ 59 ] have proposed multi-headed attention ( MHA ) methods in which machine translation is frequently used. For low-level vision tasks [ 35 , 43 , 65 ], spatially adaptive feature activation is made possible by the attention mechanism, fo cusing on more complex areas, like rich textures, saliency , etc. In image compression, quantized attention masks are used for adaptive bit allocation, e.g., Li et al. [ 35 ] used a trimmed convolutional network to pr edict the conditional probability of quantize d codes, Mentzer et al. [ 43 ] relied on a 3D-convolution neural network ( CNN )- based context model to learn a conditional probability mo del of the latent distribution. Later , Cheng et al. [ 14 ] inserted a simplie d attention module (without the non-local blo ck) into the analysis and synthesis transforms to pay more attention to complex regions. More recently , Zou et al. [ 69 ] combined the local-aware attention mechanism with the global-related feature learning within an eective window-based local attention block, which can be used as a specic component to enhance ConvNet and T ransformer models. Guo et al. [ 21 ] adopted a powerful group-separated attention module to strengthen the non-linear transform networks. Further , T ang et al. [ 56 ] integrated graph attention and asymmetric convolutional neural network ( acnn ) for end-to-end image compression, to eectively capture long-range dependencies and emphasize local key featur es, while ensuring ecient information ow and reasonable bit allocation. 2.3 Transformer-based Coding Recently , Transformers hav e been increasingly used in neural codecs. They exempt convolution operators entirely and rely on attention me chanisms to capture the interactions b etween inputs, regardless of their relative position, thus allowing the network to focus more on pertinent input data elements. Qian et al. [ 49 ] replaced the autoregressive hyperprior [ 45 ] with a self-attention stack and introduced a novel T ransformer-base d entropy model, where the T ransformer’s self-attention is used to relate dierent positions of a single latent for computing the latent representation. Zhu et al. [ 68 ] replaced all convolutions in the standard approach [ 7 , 46 ] with Swin Transformer [ 36 ] blocks, leading to a more e xible receptive eld to adapt tasks requiring both short/long-range information, and better progressive decoding of latent. Apart from their eective window-based local attention block, Zou et al. [ 69 ] proposed a novel symmetrical T ransformer (STF) framework with absolute J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:6 Ghorbel et al. Transformer blo cks for transform coding combined with a channel-wise autoregressive model ( ChARM ) prior . Inspired by the adaptive characteristics of the T ransformers, Koyuncu et al. [ 32 ] proposed a Transformer-based conte xt model, which generalizes the de facto standar d attention mechanism to spatio-channel attention. 2.4 Adaptive Do wnsampling Learning sampling techniques were rst developed for image classication to improve image- level prediction while minimizing computation costs. spatial transformer networks ( STN s) [ 25 ] introduced a layer that estimates a parametrize d ane, projective, and splines transformation from an input image to recover data distortions and thereby improve image classication accuracy . Recasens et al. [ 51 ] suggested that when downsampling an input image for classication, salient regions should be "zoomed-in" to learn a saliency-based network jointly . T alebi et al. [ 55 ] jointly optimize pixel value interpolated at each xe d downsampling location for classication. Marin et al. [ 42 ] recently argued that a better downsampling scheme should sample pixels more densely near object boundaries, and introduced a strategy that adapts the sampling lo cations based on the output of a separate edge-dete ction model. Further , Jin et al. [ 28 ] introduced a deformation mo dule and a learnable downsampling operation, which can be optimize d with the given segmentation model in an end-to-end fashion. In the context of NIC , Chen et al. [ 11 ] pr op osed a straightfor ward learne d downsampling mo dule that can be jointly optimize d with any NIC kernels in an end-to-end fashion. Based on the STN [ 25 ], a learned resize parameter is used in a bilinear warping layer to generate a sampling grid, where the input should be sampled to produce the resampled output. They also include an additional warping layer ne cessary for an inverse transformation to maintain the same r esolution as the input image. Fig. 2. Overall AICT Framework. W e illustrate the image compression diagram of our AICT with hyperprior and Swin T ransformer based ChARM , and scale adaptation module. The resize parameter network ( RPN ), ConvNeXt block, and Swin Transformer block architectures are respectively detailed in (a), (c), and ( d) Fig. 3. J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:7 3 PROPOSED AICT FRAMEW ORK In this section, we rst formulate the NIC problem. Next, w e introduce the design methodology for the overall AICT architecture , followed by the description of each component individually . 3.1 Problem Formulation The primary challenges addressed in this w ork are twofold. Firstly , we aim to identify and prioritize the most pertinent information required for accurate reconstruction. It is crucial to ackno wledge that any information overlooked during the encoding phase is typically lost and irretrievable during decoding. Secondly , we endeavor to optimize the delicate balance between coding eciency and decoding latency . While existing approaches have made strides in improving the accuracy of transform and entropy coding, there remains a pressing ne ed to mitigate deco ding latency and streamline model complexity for practical real-world deployment. T o tackle these challenges, we introduce a novel approach, denoted as AICT . AICT leverages nonlinear transform co ding and channel-wise autoregressive entropy co ding techniques, building up on Swin Transformer blocks and incorporating a neural scaling netw ork. In the conte xt of describing these contributions, it is imperative to establish a clear problem formulation, which we delve into in the following section. The objective of NIC is to minimize the distortion b etween the original image and its reconstruction under a specic distortion-controlling hyperparameter . For an input image 𝒙 , the analysis transform 𝑔 𝑎 , with parameter 𝜙 𝑔 , removes the image spatial r e dundancies and generates the latent representation 𝒚 . Then, this latent is quantized to the discrete code ˆ 𝒚 using the quantization operator ⌈ . ⌋ , from which a synthesis transform 𝑔 𝑠 , with parameter 𝜃 𝑔 , reconstructs the image denoted by ˆ 𝒙 . The overall process can be formulated as follows: 𝒚 = 𝑔 𝑎 ( 𝒙 | 𝜙 𝑔 ) , ˆ 𝒚 = ⌈ 𝒚 ⌋ , ˆ 𝒙 = 𝑔 𝑠 ( ˆ 𝒚 | 𝜃 𝑔 ) . (1) A hyperprior model composed of a hyper-analysis and hyp er-synthesis transforms ( ℎ 𝑎 , ℎ 𝑠 ) with parameters ( 𝜙 ℎ , 𝜃 ℎ ) is usually use d to reduce the statistical redundancy among latent variables. In particular , this hyp erprior model assigns a few extra bits as side information to transmit some spatial structure information and helps to learn an accurate entrop y model. The generated hyper- latent representation 𝒛 is quantized to the discrete code ˆ 𝒛 using the quantization operator ⌈ . ⌋ . The hyperprior generation can be summarized as follows: 𝒛 = ℎ 𝑎 ( 𝒚 | 𝜙 ℎ ) , ˆ 𝒛 = ⌈ 𝒛 ⌋ , 𝑝 ˆ 𝒚 | ˆ 𝒛 ( ˆ 𝒚 | ˆ 𝒛 ) ← ℎ 𝑠 ( ˆ 𝒛 | 𝜃 ℎ ) . (2) Further , considering a context model 𝑔 𝑐𝑚 with parameter 𝜓 𝑐𝑚 , and a parameter inference network 𝑔 𝑒 𝑝 with parameter 𝜓 𝑒 𝑝 which estimates, from the latent ˆ 𝒚 , the lo cation and scale parameters Φ = ( 𝜇, 𝜎 ) of the entropy model. The parameter prediction for 𝑖 -th representation ˆ 𝒚 𝑖 is expressed as follows: Φ 𝑖 = 𝑔 𝑒 𝑝 ( ℎ 𝑠 ( ˆ 𝒛 ) , 𝑔 𝑐𝑚 ( ˆ 𝒚 < 𝑖 | 𝜓 𝑐𝑚 ) | 𝜓 𝑒 𝑝 ) , (3) where Φ 𝑖 = ( 𝜇 𝑖 , 𝜎 𝑖 ) is used to jointly predict entropy parameters, and ˆ 𝒚 < 𝑖 = { ˆ 𝒚 1 , . . . , ˆ 𝒚 𝑖 − 1 } is the observable neighbors of each symb ol vector ˆ 𝒚 𝑖 at the 𝑖 -th location. 𝑝 ˆ 𝒚 𝑖 | ˆ 𝒛 ( ˆ 𝒚 𝑖 | ˆ 𝒛 ) =  0 < 𝑘 < 𝐾 𝜋 𝑘 𝑖 [ N ( 𝜇 𝑘 𝑖 ,𝜎 2 𝑘 𝑖 ) ∗ U ( − 1 2 , 1 2 ) ] ( ˆ 𝒚 𝑖 ) , (4) J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:8 Ghorbel et al. where K groups of entropy parameters ( 𝜋 𝑘 , 𝜇 𝑘 , 𝜎 𝑘 ) are calculate d by 𝑔 ep , N ( 𝜇 ,𝜎 2 ) represents the mean and scale Gaussian distribution, and U ( − 1 2 , 1 2 ) denotes the uniform noise. Both transform and quantization introduce distortion 𝐷 = 𝑀 𝑆 𝐸 ( 𝒙 , ˆ 𝒙 ) for mean squared er- ror ( MSE ) optimization that measures the reconstruction quality with an estimated bitrate 𝑅 , corresponding to the expected rate of the quantized latent and hyp er-latent, as described below: 𝑅 = E  − log 2 ( 𝑝 ˆ 𝒚 | ˆ 𝒛 ( ˆ 𝒚 | ˆ 𝒛 ) ) − log 2 ( 𝑝 ˆ 𝒛 ( ˆ 𝒛 ) )  . (5) In the case of adaptive resolution (i.e., AICT ), we consider the RPN , the downscale, and the upscale modules as ( 𝑟 𝑠 , 𝑎 𝑑 , 𝑎 𝑢 ) with parameters ( 𝜔 𝑟 , 𝜔 𝑑 , 𝜔 𝑢 ) , respe ctively . The generation process of 𝒙 𝑑 and ˆ 𝒙 is describ ed as follows: 𝑠 = 𝑟 𝑠 ( 𝒙 | 𝜔 𝑟 ) , 𝒙 𝑑 = 𝑎 𝑑 ( 𝒙 , 𝑠 | 𝜔 𝑑 ) , ˆ 𝒙 = 𝑎 𝑢 ( ˆ 𝒙 𝑑 , 𝑠 | 𝜔 𝑢 ) . (6) Representing ( 𝑔 𝑎 , 𝑔 𝑠 ) , ( ℎ 𝑎 , ℎ 𝑠 ) , ( 𝑔 𝑐𝑚 , 𝑔 𝑒 𝑝 ) , and ( 𝑟 𝑠 , 𝑎 𝑑 , 𝑎 𝑢 ) by de ep neural networks ( DNN s) enables jointly optimizing the end-to-end model by minimizing the rate-distortion trade-o L , giving a rate-controlling hyperparameter 𝜆 . This optimization problem can be expressed as follows: arg min L ( 𝒙 , ˆ 𝒙 ) = arg min 𝐷 ( 𝒙 , ˆ 𝒙 ) + 𝜆 𝑅, = arg min | | 𝒙 − ˆ 𝒙 | | 2 2 + 𝜆 ( H ( ˆ 𝒚 ) + H ( ˆ 𝒛 ) | {z } 𝑅 ) , (7) where H stands for the cross entropy . Finally , we recall that training the mo del with the gradient descent metho d requires substituting the quantization with additive uniform noise [ 5 ], prev enting the gradient from vanishing at the quantization. W e follow this method in this pap er , where the noisy representations of the latent are used to compute the rate during the training phase. 3.2 Overall Architecture The overall pipeline of the propose d solution is illustrate d in Fig. 2. The framework includes three modular parts. First, the scale adaptation module, composed of a tiny resize parameter net- work ( RPN ) [ 11 ], a ConvNeXt-based pre-/post-processor , and a bicubic interp olation lter . Second, the analysis/synthesis transform ( 𝑔 𝑎 , 𝑔 𝑠 ) of our design consists of a combination of patch merg- ing/expanding layers and Swin Transformer [ 36 ] blocks. The architectures of hyper-transforms ( ℎ 𝑎 , ℎ 𝑠 ) are similar to ( 𝑔 𝑎 , 𝑔 𝑠 ) with dierent stages and congurations. Then, a Transformer-based slice transform inside a ChARM is used to estimate the distribution parameters of the quantized la- tent. Finally , the resulting discrete-valued data ( ˆ 𝒚 , ˆ 𝒛 ) are enco ded into bit-streams with an arithmetic encoder . 3.3 Scale Adaptation Module Given a source image 𝒙 ∈ R 𝐻 × 𝑊 × 𝐶 , we rst determine an adaptiv e spatial resize factor 𝑠 ∈ R ∩ [ 0 , 1 ] estimated by the RPN module, which consists of three stages of residual blocks ( ResBlock s). Indeed, the estimate d resize parameter 𝑠 is use d to create a sampling grid 𝜏 𝑀 following the convention STN s, and used to adaptively down-scale 𝒙 into 𝒙 𝑑 ∈ R 𝐻 ′ × 𝑊 ′ × 𝐶 through the bicubic interpolation, with 𝐻 ′ = 𝑠 𝐻 and 𝑊 ′ = 𝑠 𝑊 . The latter (i.e., 𝒙 𝑑 ) is then encoded and decoded with the proposed ICT . Finally , the decode d image ˆ 𝒙 𝑑 ∈ R 𝐻 ′ × 𝑊 ′ × 𝐶 is up-scaled to the original resolution ˆ 𝒙 ∈ R 𝐻 × 𝑊 × 𝐶 using the same, initially estimated, resize parameter 𝑠 . The parameterization of each layer is detaile d J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:9 Fig. 3. Detailed description of block architectures: (a) RPN , (b) ResBlock, (c) ConvNeXt Block, and (d) Swin Transformer Block. DConv2D(.) stands for depthwise 2D convolution, LayerNorm for the layer normalization, Dense(.) for the densely-connected neural network layer , and GELU for the activation. in the RPN and ResBlock diagrams of Fig. 3 (a) and (b), respectively . In addition, a learnable depth- wise pre-/post-processor is placed before/after the bicubic sampler to mitigate the information loss introduced by down/up-scaling, allowing the retention of information. This neural pre-/post- processing method consists of concatenation b etween the input and the output of three successive ConvNeXt [ 37 ] blocks, using depth-wise convolutions with large kernel sizes to obtain ecient receptive elds. Globally , the ConvNeXt block incorporates a series of architectural choices from a Swin T ransformer while maintaining the network’s simplicity as a standard ConvNet without introducing any attention-based module. These design decisions can b e summarized as follows: macro design, ResNeXt’s grouped convolution, inverted bottleneck, large kernel size, and various layer-wise micro designs [ 37 ]. In Fig. 3 (c), w e illustrate the ConvNeXt block, wher e the DConv2D(.) refers to the depthwise 2D convolution, LayerNorm for the layer normalization, Dense(.) for the densely-connected neural network layer , and gaussian error linear unit ( GELU ) for the activation function. Finally , it is essential to note that we propose to skip the scale adaptation module for a better complexity-ecient design when the predicted scale does not change the input resolution, i.e., 𝑠  1 . The overhead to store and transmit the scale parameter 𝑠 can be ignored, given the large bitstream size of the image. 3.4 Transformer-based Analysis/Synthesis T ransform The analysis transform 𝑔 𝑎 contains four stages of patch merging layer and Swin Transformer block to obtain a more compact low-dimensional latent representation 𝒚 . In order to consciously and subtly balance the importance of feature compression through the end-to-end learning framework, we used two additional stages of patch merging layer and Swin Transformer blo ck in the hyper- analysis transform to produce the hyp erprior latent representation 𝒛 . During training, both latents 𝒚 and 𝒛 are quantized using a rounding function to produce ˆ 𝒚 and ˆ 𝒛 , respectively . During inference, both latents 𝒚 and 𝒛 are rst quantized using the same rounding function as training and then compressed using probability tables. The quantized latent variables ˆ 𝒚 and ˆ 𝒛 are then entrop y coded regarding an indexed entropy model for a location-scale family of random variables parameterized by the output of the ChARM , and a batched entropy model for continuous random variables, respectively , to obtain the bit-streams. Finally , quantized latents ˆ 𝒚 and ˆ 𝒛 feed the synthesis and hyper-synthesis transforms, respectively , to generate the reconstructed image. The decoder schemes J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:10 Ghorbel et al. are symmetric to those of the encoder , with patch-merging layers replaced by patch-expanding layers. The Swin Transformer block architecture , depicted in Fig. 3 (d), is a variant of the vision trans- former ( ViT ) that has recently gained attention due to its superior performance on a range of computer vision tasks. Therefor e, it is essential to highlight the unique features and advantages to motivate the choice of the Swin Transformer over other ViT variants. One key advantage of the Swin Transformer is its hierarchical design, which enables it to process images of various resolutions eciently . Unlike other ViT variants, Swin Transformer divides the image into smaller patches at multiple scales, allowing it to capture both local and global information. This hierar- chical design has been shown to be particularly eective for large-scale vision tasks. Another advantage of the Swin T ransformer is its ability to incorporate spatial information into its attention mechanism. Swin Transformer introduces a novel shifted window attention mechanism, which aggregates information from neighboring patches in a structured way , allowing it to capture spatial relationships between image features, leading to linear complexity w .r .t. the input resolution. This attention mechanism has been shown to outperform the standard ViT attention mechanism, whose complexity is quadratic, on a range of benchmarks. Overall, Swin Transformer’s eciency and superior performance make it a promising architecture for NIC . In addition, its ability to capture both global and local features eciently , and its adaptability to dierent image resolutions, make it a strong contender among other transformer-based architectures. 3.5 Transformer-based Slice T ransform In the realm of neural image compression, incorporating spatio-channel dependencies into entropy modeling is crucial. These dependencies, often termed spatial and channel or spatial-channel de- pendencies in prior literature [ 12 , 22 , 27 ], capture the intricate r elationships among neighboring pixels within quantized latent features across dierent channels. Recognizing these spatio-channel dependencies in the quantized latent representation is essential for eliminating redundancy along spatial and channel axes, ultimately enhancing compression eciency while maintaining perceptual quality . Aligned with this obje ctive, w e introduce the concept of separate entropy coding by parti- tioning the hyperprior latent representation into channel groups, as opposed to the conventional serial-decoded approach, to achieve more eective context modeling. Subsequently , we propose a tiny Transformer-based spatial context model that le verages cross-channel r edundancies to gen- erate highly informative adjacent contexts from the hyperprior latent slices, combined with the already-decoded latent slices. Consequently , our approach introduces a multidimensional entropy estimation model known as spatio-channel entr opy modeling, which pro ves to be both fast and eective in reducing bitrate. According to this method, each latent element is conditione d on adjacent decoded elements that are spatio-channel neighbors, ee ctively eliminating redundancy along spatial and channel axes. Fig. 4 shows our spatio-channel entr opy coding. W e apply a tiny spatial context model 𝑔 𝑖 𝑐𝑚 to exploit the spatio-channel correlations per 𝑖 𝑡 ℎ grouped hyperprior channels ˆ 𝑦 𝑖 𝑚𝑠 combined with the already-decoded latent slices { 𝑦 1 , . . . , 𝑦 𝑠 } , where { 𝑠 ∈ N | 1 ≤ 𝑠 ≤ 5 } stands for the number of supported slices. This process enhances the accuracy of entropy parameter estimation, thereby optimizing the overall eciency of entrop y coding. As a side eect, it also r esults in faster deco ding spe ed, thanks to the parallelization capabilities of the Swin T ransformer on graphics processing unit ( GP U ) [ 36 ]. In contrast to ConvNets that rely on convolution in place of general matrix multiplication and are susceptible to communication overhead when parallelizing across multiple GP U s, Swin Transformers exhibit better parallelizability on GP U s. This is attributed to their hierarchical attention mechanism, which processes attention in J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:11 Fig. 4. Spatio-channel entropy coding. 𝑔 𝑎 , ℎ 𝑠 , 𝑔 𝑖 𝑐𝑚 , AE, and AD stand for analysis and hyper-synthesis transforms, the 𝑖 𝑡 ℎ context model, and arithmetic encoder/decoder , respectively . { 𝑦 1 , . . . , 𝑦 𝑠 } stands for the already-decoded latent slices, where { 𝑠 ∈ N | 1 ≤ 𝑠 ≤ 5 } is the number of supported slices. a windowed manner , reducing global attention complexity and enabling ecient self-attention parallelization. Additionally , Swin Transformers leverage multi-head parallelism and tokenization strategies to maximize GP U utilization. These features make them a prime choice for a wide range of computer vision tasks that r e quire GP U acceleration. The tiny slice transform consists of two successive Swin T ransformer blocks with an additional learnable linear projection layer , used to get a representativ e latent slices concatenation. This ChARM estimates the distribution 𝑝 ˆ 𝒚 ( ˆ 𝒚 | ˆ 𝒛 ) with both the mean and standard deviation of each quantized latent slice and incorp orates an autoregressive conte xt model to condition the already-decoded latent slices and further reduce the spatial redundancy between adjacent pixels. 4 EXPERIMENT AL RESULTS In this section, we rst describe the experimental setup, including the used datasets, the baselines against which we compared, and the implementation details. Then, we assess the compression eciency of our metho d with a rate-distortion comparison and compute the average bitrate savings on four commonly-used evaluation datasets. W e further elab orate a model scaling study to consis- tently examine the eectiveness of our proposed method against pioneering ones. Additionally , we perform a resize parameter analysis to sho w the variations of the pr edicted parameter 𝑠 . Finally , we conduct a latent analysis, an ablation study , and a qualitative analysis to highlight the impact of our architectural choices. 4.1 Experimental Setup Datasets. The training set of the CLIC2020 dataset is use d to train the proposed models. This dataset contains professional and user-generated content images in RGB color and grayscale for- mats. W e evaluate image compression models on four datasets, including Kodak [ 31 ], T e cnick [ 4 ], JPEG- AI [ 29 ], and the testing set of CLIC21 [ 1 ]. Fig. 5 gives the number of images by pixel count for J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:12 Ghorbel et al. T able 1. Architecture configuration. IC Filter size 𝐶 𝑖 Depth size 𝑑 𝑖 𝐶 1 𝐶 2 𝐶 3 𝐶 4 𝐶 5 𝐶 6 𝑑 0 𝑑 1 𝑑 2 𝑑 3 𝑑 4 𝑑 5 𝑑 6 𝑑 7 B1 320 320 320 320 192 192 − − − − − − − − B2 128 192 256 320 192 192 − 2 2 6 2 5 1 − O1 128 192 256 320 192 192 − 2 2 6 2 5 1 2 O2 128 192 256 320 192 192 3 2 2 6 2 5 1 2 the four test datasets. Finally , for a fair comparison, all images are cropped to the highest possible multiples of 256 to avoid padding for neural codecs. Baselines. W e compare our approach with the state-of-art neural compression method SwinT - ChARM proposed by Zhu et al. [ 68 ], and the Conv-ChARM proposed by Minnen et al. [ 46 ], and non-neural compression methods, including b etter p ortable graphics ( BPG )(4:4:4), and the up- to-date VVC ocial T est Model V TM-18.0 in All-Intra prole conguration. T able 1 gives the conguration of each of the considered image codec baselines with B1 and B2 r eferring to Conv- ChARM and SwinT -ChARM, respectively , and O1 and O2 refer to our propose d approaches ICT and AICT , respectively . 𝐶 𝑖 and 𝑑 𝑖 are the hyperparameters dened in Fig. 2. W e intensively compare our solutions with Conv-ChARM [ 46 ] and SwinT -ChARM [ 68 ] from the state-of-the-art models [ 17 , 22 , 30 , 47 , 67 , 69 ], under the same training and testing conditions. Nevertheless, Fig. 9 compares our models with additional state-of-the-art solutions. Implementation details. W e implemented all models in T ensorFlow using tensorow com- pression ( TFC ) library [ 6 ], and the experimental study was carrie d out on an RTX 5000 Ti GP U . All models were trained on the same CLIC2020 training set with 2M steps using the ADAM optimizer with parameters 𝛽 1 = 0 . 9 and 𝛽 2 = 0 . 999 . The initial learning rate is set to 10 − 4 and drops to 10 − 5 for the last 200k iterations. The loss function, expression in Equation (7) , is a weighted combination of bitrate 𝑅 and distortion 𝐷 , with 𝜆 being the Lagrangian multiplier steering rate-distortion trade-o. MSE is used as the distortion metric in RGB color space. Each training batch contains eight random crops 𝒙 𝑗 ∈ 𝑅 256 × 256 × 3 from the CLIC2020 training set. T o cover a wide range of rate and distortion points, for our proposed method and respective ablation models, we trained four models with 𝜆 ∈ { 1000 , 200 , 20 , 3 } × 10 − 5 . The inference time experiments on the central processing unit ( CP U ) are performed on an Intel(R) Xeon(R) W -2145 processor running at 3.70 GHz. 4.2 Rate-Distortion Performance T o demonstrate the compression eciency of our proposed solutions, we plot the rate-distortion curves of ICT , AICT , and the baselines on benchmark datasets. Fig. 6 (a) ( 1 𝑠 𝑡 row ) gives the PSNR versus the bitrate for our solutions and baselines on the Kodak dataset. The latter gure shows that AICT and ICT equally outperform the neural approaches Conv-ChARM and SwinT -ChARM, as well as BPG (4:4:4) and V TM-18.0 traditional codecs, achieving a higher PSNR values for the dierent bitrate ranges. Moreover , we introduce Fig. 6 (a) ( 2 𝑛𝑑 row ), showing the rate savings over the VTM-18.0 on the Kodak dataset. The rate saving over VVC (%) represents the percentage reduction in bitrate achieved by a sp ecic compression mo del compared to a reference codec while maintaining an J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:13 Fig. 5. Number of images per dataset per pixel count in megapixel (Mpx). T able 2. BD-rate ↓ ( PSNR ) p erformance of BPG (4:4:4), Conv-ChARM, SwinT -ChARM, ICT , and AICT compared to the VTM-18.0. Image Codec Kodak T e cnick JPEG- AI CLIC21 A verage BPG444 22.28% 28.02% 28.37% 28.02% 26.67% Conv-ChARM 2.58% 3.72% 9.66% 2.14% 4.53% SwinT -ChARM -1.92% -2.50% 2.91% -3.22% -1.18% ICT (ours) -5.10% -5.91% -1.14% -6.44% -4.65% AICT (ours) -5.09% -5.99% -2.03% -7.33% -5.11% equivalent level of image reconstruction quality , as measured by peak signal-to-noise ratio ( PSNR ) in our context. This graph is a generalized v ersion of a Bjøntegaard Delta (BD) chart [ 8 ] by plotting rate savings as a function of quality , instead of solely presenting average savings [ 46 ]. By comparing the p erformance of dierent models using this gure, we can discern which model excels in striking a balance between compression eciency at dierent levels of reconstruction quality . Models that achieve higher bitrate savings ov er VVC at various PSNR levels ar e considered superior in terms of compression performance. ICT and AICT achieve signicant rate savings compared to the baselines, demonstrating their ability to compress images more eciently . More sp ecically , AICT , including the adaptive resolution module, achieves the highest bitrate gain at a low bitrate/quality range, where it is more benecial to reduce the spatial resolution. T o further generalize the eectiveness of our solutions, we extend the evaluation to three high resolutions datasets (T ecnick, JPEG- AI, and CLIC21), as shown in Fig. 6. The gur e illustrates the PSNR versus bitrate ( 1 𝑠 𝑡 row ), rate savings ( 2 𝑛𝑑 row ) on the considered datasets. AICT and ICT consistently achieve b etter rate-distortion performance and considerable rate savings compared to the existing traditional codecs and the neural codecs Conv-ChARM and SwinT -ChARM, demonstrating their eciency across dier ent high-resolution images and datasets. J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:14 Ghorbel et al. 0 1 2 3 R ate (bits per pix el) 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 PSNR(dB) (RGB) BPG (4:4:4) VTM-18.0 Conv -ChARM SwinT -ChARM ICT (ours) AICT (ours) 0.0 0.5 1.0 1.5 2.0 R ate (bits per pix el) 30 32 34 36 38 40 42 44 0.0 0.5 1.0 1.5 2.0 2.5 27.5 30.0 32.5 35.0 37.5 40.0 42.5 0.0 0.5 1.0 1.5 2.0 30.0 32.5 35.0 37.5 40.0 42.5 45.0 30 35 40 PSNR (RGB) 30 20 10 0 10 R ate saving (%) over VTM-18.0 32 34 36 38 40 42 50 40 30 20 10 0 10 30.0 32.5 35.0 37.5 40.0 40 30 20 10 0 10 32.5 35.0 37.5 40.0 42.5 40 30 20 10 0 10 20 (a) K odak (b) T ecnick (c) JPEG- AI (d) CLIC21 Fig. 6. Comparison of compression eiciency on Kodak, T ecnick, JPEG-AI, and CLIC21 datasets. Rate- distortion ( PSNR vs. rate (bpp)) comparison and rate saving over V TM-18.0 (larger is beer) are respectively illustrated for each benchmark dataset. T able 3. BD-rate ↓ performance of SwinT -ChARM, ICT, and AICT compared to the Conv-ChARM. Image Codec Kodak T e cnick JPEG- AI CLIC21 A verage BD-rate (PSNR) ↓ SwinT -ChARM -4.24% -6.40% -6.13% -5.37% -5.54% ICT (ours) -7.30% -9.52% -9.85% -8.47% -8.79% AICT (ours) -7.28% -9.68% -10.20% -9.35% -9.13% BD-rate (MS-SSIM) ↓ SwinT -ChARM -6.34% -7.01% -7.49% -6.30% -6.79% ICT (ours) -7.60% -8.31% -9.29% -7.50% -8.18% AICT (ours) -7.58% -8.31% -9.87% -7.67% -8.36% Furthermore, w e assessed the eectiveness of our methods using the perceptual quality metric multi-scale structural similarity index ( MS-SSIM ) on the four b enchmark datasets. T o calculate MS-SSIM in decibels (dB), a logarithmic transformation is applied to the original MS-SSIM values as follows: 𝑀 𝑆 𝑆 𝑆 𝐼 𝑀 ( 𝑑 𝐵 ) = − 10 × log 10 ( 1 − 𝑀 𝑆 𝑆 𝑆 𝐼 𝑀 ) . This transformation is performed to provide a more intuitive and interpretable scale for comparing image quality , as previously done in several works [ 17 , 22 , 30 , 63 , 67 ]. Fig. 7 gives the MS-SSIM scores expressed versus the bitrate for the four test datasets. As illustrated in Fig. 7, our metho ds yield better coding performance than the current neural baselines in terms of MS-SSIM . Note that we haven’t optimized our approaches and baselines using MS-SSIM , as a dierentiable distortion measure in the loss function, during the J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:15 training process. Thus, optimizing the solutions with MS-SSIM will further improv e the performance regarding this metric. Besides the rate-distortion rate savings curves, we also evaluate dierent models using Bjonte- gaard’s metric [ 8 ], which computes the average bitrate savings (%) between two rate-distortion curves. In T able 2, we summarize the BD-rate ( PSNR ) of image codecs across all four datasets, compared to the VTM-18.0 as the anchor . On average, ICT and AICT are able to respe ctively achieve -4.65% and -5.11% rate reductions compared to V TM-18.0 and -3.47% and -3.93% relative gain from SwinT -ChARM. In addition, T able 3 presents the BD-rate ( PSNR / MS-SSIM ) of SwinT -ChARM, ICT , and AICT across the considered datasets, compared with the anchor Conv-ChARM. Once again, ICT and AICT are able to outperform the neural approach Conv-ChARM, with average rate reductions ( PSNR ) of -8.79% and -9.13% and average rate reductions ( MS-SSIM ) of -8.18% and -8.36%, respectively , outp erforming SwinT -Charm solution on the four benchmark datasets and the two image quality metrics. Overall, the proposed ICT and AICT have demonstrated strong rate-distortion performance on various benchmark datasets. This indicates that our approaches can better preserve image quality at lower bitrates, highlighting its potential for practical applications in image compression. 0.0 0.5 1.0 1.5 2.0 2.5 R ate (bits per pix el) 10.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 MS-SSIM(dB) (RGB) (a) K odak BPG (4:4:4) VTM-18.0 Conv -ChARM SwinT -ChARM ICT (ours) AICT (ours) 0.0 0.5 1.0 1.5 2.0 14 16 18 20 22 24 26 (b) T ecnick 0.0 0.5 1.0 1.5 2.0 2.5 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0 (c) JPEG- AI 0.0 0.5 1.0 1.5 2.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0 (d) CLIC21 Fig. 7. Comparison of compression eiciency on the K o dak, T ecnick, JPEG- AI, and CLIC21 datasets. Rate distortion (MS-SSIM vs the rate (bpp)) comparison is illustrated for each b enchmark dataset. 4.3 Models Scaling Study W e evaluated the decoding complexity of the proposed and baseline neural codecs by averaging decoding time acr oss 7000 images at 256 × 256 × 3 r esolution, encoded at var ying bitrates, spe cically {0.1, 0.8, 1.5} (bpp). Subsequently , we computed the av erage decoding time for this dataset, resulting in an overall average bit rate of 0.8bpp. T able 4 gives the image codec complexity features, including the decoding time on GP U and CP U , oating point operations per second ( FLOPs ), and the total model parameters. Finally , we recall that the models run with T ensorow 2.8 on a workstation with one RTX 5000 Ti GP U. Compared to the neural baselines, ICT can achieve faster deco ding spee d on GP U but not on CP U , which proves the parallel pr o cessing ability to speed up compression on GP U and the well- engineered designs of both transform and entropy co ding. This is p otentially helpful for conducting high-quality real-time visual data streaming. Our AICT is on par with ICT in terms of the number of parameters, FLOPs , and latency , indicating the lightweight nature of the scale adaptation module with consistent coding gains over four datasets and two quality metrics. Fig. 8 gives the BD-rate ( with VTM-18.0 as an anchor) performance versus the FLOPs per pixel of the ICT , AICT , SwinT -ChARM and Conv-ChARM on the Kodak dataset. W e can notice that ICT and AICT are in an interesting area, achieving a good trade-o between BD-rate score on J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:16 Ghorbel et al. 300 400 500 600 700 800 FL OP s/PXL 8 6 4 2 0 2 4 6 BD-rate with r espect to VTM-18.0 (%) Conv -ChARM (53.9M) SwinT -ChARM (31.3M) AICT / ICT (37.2M / 37.1M) Fig. 8. Model size scaling. BD-rate ( PSNR ) on Kodak dataset versus FLOPs per pixel for the proposed AICT and ICT compared to Conv-ChARM and SwinT -ChARM (for both encoding and deco ding). Circle sizes indicate the numbers of parameters. Le-top is beer . T able 4. Image co dec complexity . W e calculate d the average decoding latency across 7000 images at 256 × 256 resolution, encoded on average at 0.8 bpp. The best score is highlighted in bold. Image Codec Latency(ms) ↓ MFLOPs ↓ #parameters (M) ↓ GP U CP U Conv-ChARM 133.8 359.8 126.1999 53.8769 SwinT -ChARM 91.8 430.7 63.2143 31.3299 ICT (ours) 80.1 477.0 74.7941 37.1324 AICT (ours) 88.3 493.3 74.9485 37.2304 Kodak, total model parameters, and FLOPs per pixel, reecting an ecient and hardware-friendly compression model. Finally , Fig. 9 shows the BD-rate (with V TM-18.0 as an anchor) versus the decoding time of various codecs on the Kodak dataset. It can b e seen from the gure that our ICT and AICT achieve a good trade-o between BD-rate performance and de coding time. Furthermore, the symmetrical architecture of the proposed solutions allows similar complexity at both the encoder and decoder . This feature can be an advantage of neural codecs, since the b est conventional codecs like VVC exhibit more complex encoding than decoding. 4.4 Resize Parameter Analysis W e conduct a resize parameter analysis through the benchmark datasets, including the images of the four datasets with various resolutions, as illustrated in Fig. 5. Fig. 10 shows how the parameter 𝑠 varies according to the weighting parameter 𝜆 (i.e., bitrate) for the four datasets. First, we can notice that the estimated resize parameter 𝑠 depends on the bitrate and the spatial characteristics of the image content. Resizing the input image to a lower resolution is J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:17 50 100 150 200 250 300 10 5 0 5 10 15 20 25 VTM-18.0 SwinT -ChARM ICT (ours) AICT (ours) Conv -ChARM Ballé et al. 2018 He et al. 2022 1 0 4 Cheng et al. 2020 L ee et al. 2019 W u et al. 2020 Guo et al. 2021 Decoding time (ms) BD-rate with r espect to VTM-18.0 (%) Fig. 9. BD-rate ( PSNR ) versus decoding time (ms) on the Kodak dataset. Star and diamond markers refer to decoding on GP U and CP U, respectively . Le-top is beer . frequently observed at a low bitrate, where the compression removes the image details. In contrast, the down-sampling is not performe d at a high bitrate to reach high image quality , signicantly when the up-sampling module cannot recover the image details at the deco der . Nevertheless, even at a high bitrate , a few samples are down-sampled to a lower resolution, especially images with low spatial information that the up-sampling module can easily recover on the decoder side. This also can explain the higher co ding gain brought by the adaptive sampling mo dule of AICT on datasets, including more high-resolution images such as JPEG- AI and CLIC21 (see Fig. 5). T o gain a deeper understanding of the observed performance variations in T ables 2 and 3, we direct our focus towards the characteristics of the test datasets. Illustrated in Fig. 5, the distribution of images by pixel count for the four test datasets, including Kodak, T e cnick, and two datasets ( JPEG- AI and CLIC21) with higher-resolution images. It is noteworthy that, as depicted in the Fig. 5, Kodak and T e cnick comprise images with a signicantly lo wer total number of pixels. This particular attribute inherently impacts the eectiveness of certain modules, like adaptive scaling, given the correlation between the predicted resize factor and the total numb er of pixels in each image. Furthermore, the content complexity within these datasets may contribute to the observed limited impact of adaptive scaling. As demonstrated in the Fig. 10, images with higher high-frequency details might not experience as signicant benets from such mo dules. In addition, skipping the resize modules for a predicted scale close to 1 𝑠  1 contribute to reducing encoding and decoding complexity . 4.5 Latent Analysis Transform coding is motivated by the idea that coding is mor e eective in the transform domain than in the original signal space. A desirable transform would decorrelate the source signal so that a simple scalar quantization and factorize d entropy model can b e applie d without constraining coding performance. Furthermore, an appropriate prior model would provide context adaptivity and utilize distant spatial relations in the latent tensor . The eectiveness of the analysis transforms 𝑔 𝑎 can then b e evaluated by measuring the lev el of correlation in the latent signal ˆ 𝒚 . W e are particularly interested in measuring the corr elation between nearby spatial positions, which are heavily correlated in the source domain for natural images. In Fig. 11, we visualize the normalize d spatial correlation of ˆ 𝒚 averaged over all latent channels and compare Conv-ChARM and SwinT - ChARM with the propose d ICT at 𝜆 = 0 . 002 . W e can observe that while both lead to small cross-correlations, ICT can decorrelate the latent with a slight impro vement when compared to J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:18 Ghorbel et al. 0.01 0.002 0.0002 0.00003 lower bitrate higher bitrate 0.5 0.6 0.7 0.8 0.9 1.0 P r edicted r esize parameter s K odak T ecnick JPEG- AI CLIC21 Fig. 10. Box plot of predicted resize parameter 𝑠 versus the weighting parameter 𝜆 , evaluated across the four considered datasets. The ’ ◦ ’ symbol denotes outliers. 0 1 2 3 4 0 1 2 3 4 -0.0263 0.0269 0.0593 0.0275 -0.0226 0.0227 0.1579 0.2295 0.1552 0.0340 0.0537 0.2439 1.0000 0.2437 0.0695 0.0165 0.1468 0.2180 0.1452 0.0314 -0.0270 0.0297 0.0603 0.0285 -0.0218 Conv -ChARM (lambda=0.002) PSNR=35.20dB, BPP=0.2776 0.0 0.2 0.4 0.6 0.8 1.0 (a) Conv-ChARM 0 1 2 3 4 0 1 2 3 4 0.0036 0.0213 0.0741 0.0258 0.0040 0.0291 0.0695 0.1606 0.0683 0.0266 0.0993 0.2012 1.0000 0.2014 0.1028 0.0182 0.0614 0.1512 0.0605 0.0173 -0.0101 0.0044 0.0523 0.0022 -0.0158 SwinT -ChARM (lambda=0.002) PSNR=35.35dB, BPP=0.2555 0.0 0.2 0.4 0.6 0.8 1.0 (b) SwinT -ChARM 0 1 2 3 4 0 1 2 3 4 0.0131 0.0257 0.0777 0.0289 0.0108 0.0257 0.0558 0.1405 0.0571 0.0239 0.0887 0.1673 1.0000 0.1676 0.0874 0.0144 0.0476 0.1303 0.0484 0.0147 -0.0025 0.0085 0.0541 0.0055 -0.0072 ICT (lambda=0.002) PSNR=35.48dB, BPP=0.2514 0.0 0.2 0.4 0.6 0.8 1.0 (c) ICT Fig. 11. The spatial correlation at index (i, j) corresponds to the normalized cross-corr elation of the recon- structed latent ( ˆ 𝑦 − 𝜇 ) 𝜎 at spatial location ( 𝑤 𝑐 , ℎ 𝑐 ) and ( 𝑤 𝑐 + 𝑖 , ℎ 𝑐 + 𝑗 ) , averaged across all latent channels of all image patches across the four considered datasets. W e considered (a) Conv-ChARM, (b) SwinT -ChARM, and the proposed (c) ICT, all trained at 𝜆 = 0 . 002 . SwinT -ChARM and a much considerable improvement when compared to Conv-ChARM. This suggests that Transformer-based transforms with T ransformer-base d entropy modeling incur less redundancy across dierent spatial latent locations than convolutional ones, leading to an overall better rate-distortion trade-o. 4.6 Ablation Study T o investigate the impact of the proposed ICT and AICT , we conduct an ablation study accord- ing to the reported BD-rate ↓ w .r .t. VVC and Conv-ChARM in T able 2 and T able 3, respe ctively . Image compression performance increases fr om Conv-ChARM to SwinT -ChARM on the consid- ered datasets due to the inter-layer featur e propagation across non-overlapping windows (local information) and self-attention mechanism (long-range dependencies) in the Swin T ransformer . With the proposed spatio-channel entropy model, ICT is able to achieve on average -3.47% ( PSNR ) and -1.39% ( MS-SSIM ) rate reductions compared to SwinT -ChARM. Moreover , AICT is enhancing ICT on average by -0.46% ( PSNR ) and -0.18% ( MS-SSIM ) rate reductions with consistent gain over J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:19 Conv-ChARM [0.4656/30.78/14.70] SwinT-ChARM [0.4421/30.80/14.87] Ground Truth kodim14.png ICT (ours) [0.4420/31.01/14.90] AICT (ours) [0.4414/30.93/14.88] Conv-ChARM [0.2278/33.03/13.91] SwinT-ChARM [0.2138/33.15/14.15] Ground Truth kodim04.png ICT (ours) [0.2133/33.28/14.23] AICT (ours) [0.2129/33.24/14.20] Fig. 12. Visualization of the reconstructed images from the K o dak dataset. The metrics are [bpp ↓ /PSNR(dB) ↑ /MS-SSIM(dB) ↑ ]. the four datasets. This indicates that introducing a scale adaptation mo dule can further reduce spatial redundancies and alleviate coding artifacts, especially at low bitrate for higher compression eciency . More imp ortantly , the adaptive resolution may also reduce the complexity of the enco der and deco der regarding the number of op erations per pixel since fewer pixels are processed on average by the codec when the input image is downscaled to a lower resolution, i.e ., 𝑠 < 1 . 4.7 alitative Analysis T o assess the perceptual quality of the decoded images, we visualize two r econstructed samples with the proposed ICT and AICT methods, along with Conv-ChARM and SwinT -ChARM, all trained at the same low bitrate conguration ( 𝜆 = 0 . 002 ). Fig. 12 presents the visualization of the reconstructed kodim14 and kodim04 images from the K o dak dataset. Although not immediately apparent in Fig. 12, the propose d ICT and AICT may manifest subtle improvements in specic image regions or content types. A closer examination reveals discernible dierences in certain areas of the reconstructed images. For instance, in the ’kodim14.png’ gure, ther e is a noticeable enhancement in the intensity of black pixels within the text elements in the proposed mo dels, ICT and AICT , compared to the baselines. This signies a ner level of detail pr eservation in the text, which can be particularly crucial in applications involving textual content. Moreover , in the same ’kodim14.png’ gure, in the last row of patches, it can b e obser ved that the representation of water soaking the green sweater worn by the person is signicantly better in the ICT model compared to the other models. This improved depiction of intricate te xtures and ner details highlights the model’s capability to faithfully capture and reproduce complex image features, which may b e essential in scenarios where preserving ne textures is critical for maintaining perceptual quality . In summar y , under a similar rate budget, ICT and AICT perform b etter in maintaining texture details and clean edges while suppressing visual artifacts of the decode d images compared to Conv-ChARM and SwinT -ChARM neural approaches. Additionally , the self-attention mechanism focuses more on high-contrast image regions and consequently achieves more coding eciency on such content. J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:20 Ghorbel et al. 5 CONCLUSION In this work, we have pr op osed an up-and-coming neural codec AICT , achieving compelling rate- distortion p erformance while signicantly reducing the latency , which is potentially helpful to conduct, with further optimizations, high-quality real-time visual data transmission. W e inherited the advantages of self-attention units from T ransformers to approximate the mean and standard deviation for ecient entropy modeling and combine global and local texture to capture correlations among spatially neighboring elements for non-linear transform coding. Furthermore, we have presented a lightweight spatial resolution scale adaptation module to enhance compression ability , especially at low bitrates. The experimental results, conducted on four datasets, showed that ICT and AICT approaches outperform the state-of-the-art conventional codec VVC , achieving respectively -4.65% and -5.11% BD-rate reductions compared to the V TM-18.0 by averaging over the benchmark datasets. With the dev elopment of GP U and neural processing unit ( NP U ) chip technologies and further universal optimization frameworks [ 66 ], neural code cs will be the future of visual data coding, achieving better compression eciency when compar ed with traditional codecs and aiming to bridge the gap to a real-time processing. REFERENCES [1] CLIC 2022. 2022. 5th W orkshop and Challenge on Learned Image Compression. http:// compression.cc/ tasks/ . [2] Alexander Alemi, Ben Poole, Ian Fischer , Joshua Dillon, Rif A Saur ous, and K evin Murphy . 2018. Fixing a broken ELBO . In International conference on machine learning . PMLR, 159–168. [3] Hadi Amirpour , Antonio Pinheiro, Manuela Pereira, Fernando J. P . Lopes, and Mohammad Ghanbari. 2022. Ecient Light Field Image Compression with Enhanced Random Access. ACM Trans. Multime dia Comput. Commun. A ppl. 18, 2, Article 44 (mar 2022), 18 pages. https://doi.org/10.1145/3471905 [4] Nicola Asuni and Andrea Giachetti. [n. d.]. TESTIMAGES: a Large-scale Archive for T esting Visual Devices and Basic Image Processing Algorithms. [5] Johannes Ballé, V alero Laparra, and Eero P. Simoncelli. 2017. End-to-end Optimize d Image Compression. In International Conference on Learning Representations . [6] Johannes Ballé, Sung Jin Hwang, and Eirikur Agustsson. 2022. TensorFlow Compression: Learned Data Compression . http://github.com/tensorow/compr ession [7] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. V ariational image compression with a scale hyperprior . In International Conference on Learning Representations . [8] Gisle Bjontegaard. 2001. Calculation of average PSNR dierences between RD-curves. VCEG-M33 (2001). [9] Benjamin Bross, Y e-Kui W ang, Y an Y e, Shan Liu, Jianle Chen, Gar y J. Sullivan, and Jens-Rainer Ohm. 2021. Overview of the V ersatile Video Co ding (VVC) Standard and its Applications. IEEE Transactions on Circuits and Systems for Video T echnology 31, 10 (2021), 3736–3764. https://doi.org/10.1109/TCSV T .2021.3101953 [10] Ke-Hong Chen and Yih-Chuan Lin. 2023. Adaptive VQV AE: a learning-based image compression framework with vector quantization. In Proceedings of the 2023 11th International Conference on Computer and Communications Management ( , Nagoya, Japan, ) (ICCCM ’23) . Association for Computing Machinery , New Y ork, N Y , USA, 76–82. https://doi.org/10.1145/3617733.3617746 [11] Li-Heng Chen, Christos G Bampis, Zhi Li, Lukáš Krasula, and Alan C Bovik. 2022. Estimating the Resize Parameter in End-to-end Learned Image Compression. arXiv preprint arXiv:2204.12022 (2022). [12] T ong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Y ao W ang. 2019. Neural image compression via non-local attention optimization and improved context modeling. arXiv preprint arXiv:1910.06244 (2019). [13] T ong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Y ao W ang. 2021. End-to-end learnt image compression via non-local attention optimization and improved conte xt modeling. IEEE Transactions on Image Processing 30 (2021), 3179–3191. [14] Zhengxue Cheng, Heming Sun, Masaru T akeuchi, and Jiro Katto. 2020. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 7939–7948. [15] Y oojin Choi, Mostafa El-Khamy , and Jungwon Lee. 2019. V ariable rate deep image compression with a conditional autoencoder . In Proce e dings of the IEEE/CVF International Conference on Computer Vision . 3146–3154. [16] Emilien Dupont, Adam Golinski, Milad Alizadeh, Y ee Whye T eh, and Arnaud Doucet. 2021. COIN: COmpression with Implicit Neural representations. In Neural Compression: From Information Theor y to A pplications – W orkshop @ ICLR J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:21 2021 . [17] Alaaeldin El-Nouby , Matthew J. Muckley , Karen Ullrich, Ivan Laptev , Jakob V erbeek, and Her ve Jegou. 2023. Image Compression with Product Quantized Masked Image Modeling. Transactions on Machine Learning Research (2023). [18] Brendan J Frey . 1997. Bayesian networks for pattern classication, data compression, and channel co ding . Citeseer . [19] M.J. Gormish, D . Lee, and M. W . Marcellin. 2000. JPEG 2000: overview , architecture, and applications. In Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101) , V ol. 2. 29–32 v ol.2. https://doi.org/10.1109/ICIP. 2000.899217 [20] Karol Gregor , Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. 2016. T owards conceptual compression. Advances In Neural Information Processing Systems 29 (2016). [21] Zongyu Guo, Zhizheng Zhang, Runsen Feng, and Zhib o Chen. 2022. Causal Contextual Prediction for Learned Image Compression. IEEE Transactions on Circuits and Systems for Video T e chnology 32, 4 (2022), 2329–2341. https: //doi.org/10.1109/TCSVT .2021.3089491 [22] Dailan He, Ziming Y ang, W eikun Peng, Rui Ma, Hongwei Qin, and Y an W ang. 2022. Elic: Ecient learned image com- pression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 5718–5727. [23] Dailan He , Y aoyan Zheng, Baocheng Sun, Y an W ang, and Hongwei Qin. 2021. Checkerb oard context model for ecient learned image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 14771–14780. [24] Y ueyu Hu, W enhan Y ang, and Jiaying Liu. 2020. Coarse-to-ne hyper-prior modeling for learne d image compression. In Proceedings of the AAAI Conference on Articial Intelligence , V ol. 34. 11013–11020. [25] Max Jaderberg, Karen Simonyan, Andre w Zisserman, et al . 2015. Spatial transformer networks. Advances in neural information processing systems 28 (2015). [26] Herve Jegou, Cordelia Schmid, Hedi Harzallah, and Jakob V erbeek. 2008. Accurate image search using the contextual dissimilarity measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1 (2008), 2–11. [27] W ei Jiang, Jiayu Y ang, Y ongqi Zhai, Peirong Ning, Feng Gao, and Ronggang Wang. 2023. MLIC: Multi-Reference Entropy Model for Learned Image Compression. In Pr oceedings of the 31st ACM International Conference on Multimedia ( , Ottawa ON, Canada, ) (MM ’23) . A sso ciation for Computing Machinery , New Y ork, N Y , USA, 7618–7627. https://doi.org/10.1145/3581783.3611694 [28] Chen Jin, Ryutaro T anno, Thomy Mertzanidou, Eleftheria Panagiotaki, and Daniel C. Alexander . 2022. Learning to Downsample for Segmentation of Ultra-High Resolution Images. In International Conference on Learning Representa- tions . [29] JPEG- AI. 2020. JPEG-AI T est Images. https:// jpegai.github.io/ test_images/ . [30] Jun-Hyuk Kim, Byeongho Heo, and Jong-Seok Le e. 2022. Joint Global and Local Hierarchical Priors for Learned Image Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 5992–6001. [31] Kodak. 1999. Kodak T est Images. http: // r0k.us/ graphics/ kodak/ . [32] A Burakhan Koyuncu, Han Gao, Atanas Boev , Georgii Gaikov , Elena Alshina, and Eckehard Steinbach. 2022. Con- textformer: A transformer with spatio-channel attention for context modeling in learne d image compression. In Computer Vision–ECCV 2022: 17th European Conference, T el A viv , Israel, October 23–27, 2022, Procee dings, Part XIX . Springer , 447–463. [33] Jooyoung Lee, Seunghyun Cho, Se Y o on Jeong, Hyoungjin K won, Hyunsuk Ko, Hui Y ong Kim, and Jin So o Choi. 2019. Extended End-to-End optimized Image Compression Method based on a Context-Adaptiv e Entropy Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Re cognition (CVPR) W orkshops . [34] Mu Li, K ai Zhang, Jinxing Li, W angmeng Zuo, Radu Timofte, and David Zhang. 2023. Learning Context-Based Nonlocal Entropy Modeling for Image Compression. IEEE Transactions on Neural Networks and Learning Systems 34, 3 (2023), 1132–1145. https://doi.org/10.1109/TNNLS.2021.3104974 [35] Mu Li, W angmeng Zuo, Shuhang Gu, Jane Y ou, and David Zhang. 2020. Learning content-weighted de ep image compression. IEEE transactions on pattern analysis and machine intelligence 43, 10 (2020), 3446–3461. [36] Ze Liu, Y utong Lin, Y ue Cao, Han Hu, Yixuan W ei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 10012–10022. [37] Zhuang Liu, Hanzi Mao, Chao- Y uan Wu, Christoph Feichtenhofer , Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Re cognition . 11976–11986. [38] Minh- Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Ee ctive approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015). [39] Y ue Lv , Jinxi Xiang, Jun Zhang, W enming Yang, Xiao Han, and W ei Y ang. 2023. Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression. In Procee dings of the 31st ACM International Conference on Multimedia (, Ottawa ON, Canada,) (MM ’23) . Association for Computing Machinery , New Y ork, N Y , USA, 632–642. J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. 111:22 Ghorbel et al. https://doi.org/10.1145/3581783.3612187 [40] Haichuan Ma, Dong Liu, Ning Y an, Houqiang Li, and Feng Wu. 2022. End-to-End Optimized V ersatile Image Com- pression With W avelet-Like Transform. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022), 1247–1263. https://doi.org/10.1109/TP AMI.2020.3026003 [41] Siwei Ma, Xinfeng Zhang, Chuanmin Jia, Zhenghui Zhao, Shiqi W ang, and Shanshe W ang. 2020. Image and Video Compression With Neural Networks: A Revie w . IEEE Transactions on Circuits and Systems for Video T e chnology 30, 6 (2020), 1683–1698. https://doi.org/10.1109/TCSVT .2019.2910119 [42] Dmitrii Marin, Zijian He, Peter V ajda, Priyam Chatterjee, Sam Tsai, Fei Y ang, and Y uri Boykov . 2019. Ecient segmen- tation: Learning downsampling near semantic boundaries. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 2131–2141. [43] Fabian Mentzer , Eirikur Agustsson, Michael T schannen, Radu Timofte, and Luc V an Gool. 2018. Conditional probability models for deep image compression. In Procee dings of the IEEE Conference on Computer Vision and Pattern Recognition . 4394–4402. [44] Fabian Mentzer , George D T oderici, Michael T schannen, and Eirikur Agustsson. 2020. High-delity generative image compression. Advances in Neural Information Processing Systems 33 (2020), 11913–11924. [45] David Minnen, Johannes Ballé, and George D T oderici. 2018. Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems 31 (2018). [46] David Minnen and Saurabh Singh. 2020. Channel-wise autoregressive entropy models for learned image compression. In 2020 IEEE International Conference on Image Processing (ICIP) . IEEE, 3339–3343. [47] Matthew J Muckley , Alaaeldin El-Nouby , Karen Ullrich, Hervé Jégou, and Jakob V erbeek. 2023. Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likeliho od Models. arXiv preprint arXiv:2301.11189 (2023). [48] Y uxiang Peng, Chong Fu, Guixing Cao, W ei Song, Junxin Chen, and Chiu- Wing Sham. 2024. JPEG-compatible Joint Image Compression and Encryption Algorithm with File Size Preservation. ACM T rans. Multimedia Comput. Commun. A ppl. 20, 4, Article 105 (jan 2024), 20 pages. https://doi.org/10.1145/3633459 [49] Yichen Qian, Xiuyu Sun, Ming Lin, Zhiyu Tan, and Rong Jin. 2022. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. In International Conference on Learning Representations . [50] Yichen Qian, Zhiyu T an, Xiuyu Sun, Ming Lin, Dongyang Li, Zhenhong Sun, Li Hao, and Rong Jin. 2021. Learning Accurate Entropy Mo del with Global Reference for Image Compression. In International Conference on Learning Representations . [51] Adria Recasens, Petr K ellnhofer , Simon Stent, W ojcie ch Matusik, and Antonio T orralba. 2018. Learning to zoom: a saliency-based sampling layer for neural networks. In Procee dings of the European Conference on Computer Vision (ECCV) . 51–66. [52] M. M. Reid, R. J. Millar , and N. D. Black. 1997. Second-generation image co ding: an overview . ACM Comput. Surv . 29, 1 (mar 1997), 3–29. https://doi.org/10.1145/248621.248622 [53] Y annick Strümpler , Janis Postels, Ren Y ang, Luc V an Gool, and Fe derico T ombari. 2022. Implicit neural representations for image compression. In Computer Vision–ECCV 2022: 17th European Conference, T el A viv , Israel, October 23–27, 2022, Proceedings, Part XX VI . Springer , 74–91. [54] Gary J. Sullivan, Jens-Rainer Ohm, W o o-Jin Han, and Thomas Wiegand. 2012. Overview of the High Eciency Vide o Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Vide o T echnology 22, 12 (2012), 1649–1668. https://doi.org/10.1109/TCSVT .2012.2221191 [55] Hossein T alebi and Peyman Milanfar . 2021. Learning to resize images for computer vision tasks. In Proce edings of the IEEE/CVF International Conference on Computer Vision . 497–506. [56] Zhisen T ang, Hanli W ang, Xiaokai Yi, Yun Zhang, Sam K wong, and C.-C. Jay Kuo. 2023. Joint Graph Attention and Asymmetric Convolutional Neural Network for Deep Image Compression. IEEE Transactions on Circuits and Systems for Video T echnology 33, 1 (2023), 421–433. https://doi.org/10.1109/TCSVT .2022.3199472 [57] T ao Tian, Hanli W ang, Sam K wong, and C.-C. Jay Kuo. 2021. Perceptual Image Compression with Block-Level Just Noticeable Dierence Prediction. 16, 4, Article 126 (jan 2021), 15 pages. https://doi.org/10.1145/3408320 [58] Aaron V an den Oord, Nal Kalchbrenner , Lasse Espeholt, Oriol Vinyals, Alex Graves, et al . 2016. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29 (2016). [59] Ashish V aswani, Noam Shazeer, Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz K aiser , and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017). [60] G.K. W allace. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38, 1 (1992), xviii–xxxiv . https://doi.org/10.1109/30.125072 [61] Lilong W ang, Yunhui Shi, Jin W ang, Shujun Chen, Baocai Yin, and Nam Ling. 2024. Graph Based Cross-Channel Transform for Color Image Compression. ACM Trans. Multimedia Comput. Commun. A ppl. 20, 4, Article 102 (jan 2024), 25 pages. https://doi.org/10.1145/3631710 J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023. Joint Hierarchical Priors and Adaptive Spatial Resolution for Eicient Neural Image Compr ession 111:23 [62] Y aojun Wu, Xin Li, Zhizheng Zhang, Xin Jin, and Zhibo Chen. 2022. Learne d Block-Based Hybrid Image Compression. IEEE Transactions on Circuits and Systems for Video T echnology 32, 6 (2022), 3978–3990. https://doi.org/10.1109/TCSVT . 2021.3119660 [63] Y ueqi Xie, Ka Leong Cheng, and Qifeng Chen. 2021. Enhanced Invertible Encoding for Learned Image Compression. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM ’21) . Association for Computing Machinery , New Y ork, N Y , USA, 162–170. https://doi.org/10.1145/3474085.3475213 [64] Naifu Xue and Yuan Zhang. 2024. Lambda-Domain Rate Control for Neural Image Compression. In Proce e dings of the 5th ACM International Conference on Multime dia in Asia (, T ainan, Taiwan,) (MMAsia ’23) . Association for Computing Machinery , New Y ork, N Y , USA, Article 3, 7 pages. https://doi.org/10.1145/3595916.3626372 [65] Y ulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. 2019. Residual Non-lo cal Attention Networks for Image Restoration. In International Conference on Learning Representations . [66] Jing Zhao, Bin Li, Jiahao Li, Ruiqin Xiong, and Y an Lu. 2023. A Universal Optimization Framework for Learning-based Image Codec. 20, 1, Article 16 (aug 2023), 19 pages. https://doi.org/10.1145/3580499 [67] Xiaosu Zhu, Jingkuan Song, Lianli Gao, Feng Zheng, and Heng T ao Shen. 2022. Unied Multivariate Gaussian Mixture for Ecient Neural Image Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 17612–17621. [68] Yinhao Zhu, Y ang Y ang, and Taco Cohen. 2021. Transformer-based Transform Coding. In International Conference on Learning Representations . [69] Renjie Zou, Chunfeng Song, and Zhao xiang Zhang. 2022. The Devil Is in the Details: Windo w-based Attention for Image Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 17492–17501. Received 15 January 2023 J. A CM, V ol. 37, No. 4, Article 111. Publication date: August 2023.

Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment