HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution
Recent advances in implicit neural representations (INRs) have shown significant promise in modeling visual signals for various low-vision tasks including image super-resolution (ISR). INR-based ISR methods typically learn continuous representations, providing flexibility for generating high-resolution images at any desired scale from their low-resolution counterparts. However, existing INR-based ISR methods utilize multi-layer perceptrons for parameterization in the network; this does not take account of the hierarchical structure existing in local sampling points and hence constrains the representation capability. In this paper, we propose a new \textbf{H}ierarchical encoding based \textbf{I}mplicit \textbf{I}mage \textbf{F}unction for continuous image super-resolution, \textbf{HIIF}, which leverages a novel hierarchical positional encoding that enhances the local implicit representation, enabling it to capture fine details at multiple scales. Our approach also embeds a multi-head linear attention mechanism within the implicit attention network by taking additional non-local information into account. Our experiments show that, when integrated with different backbone encoders, HIIF outperforms the state-of-the-art continuous image super-resolution methods by up to 0.17dB in PSNR. The source code of HIIF will be made publicly available at \url{www.github.com}.
💡 Research Summary
The paper introduces HIIF (Hierarchical Encoding based Implicit Image Function), a novel framework for continuous image super‑resolution that addresses the limitations of existing implicit neural representation (INR) methods such as LIIF, LTE, CiaoSR, and CLIT. Traditional INR‑based super‑resolution models map 2‑D coordinates to RGB values using multilayer perceptrons (MLPs) and a single‑scale positional encoding. This approach neglects the hierarchical relationships among neighboring sampling points and lacks a mechanism to capture long‑range dependencies efficiently.
HIIF tackles these issues with two key innovations. First, it employs multi‑scale hierarchical positional encoding. For each query coordinate, the method computes a set of encodings at several resolution levels (l = 0 … L‑1). At each level the local coordinate is scaled by a factor S, discretized, and embedded as δ_h(x_q, l). By feeding these encodings sequentially into the network, neighboring points share intermediate features at coarser levels while finer levels provide high‑frequency detail, effectively creating a multi‑frequency representation within a single decoder.
Second, HIIF integrates a multi‑head linear attention (MHA) module into the decoder. Conventional self‑attention is quadratic in the number of sampled points and thus impractical for dense image grids. Linear attention projects keys and values into a lower‑dimensional space, reducing complexity while preserving the ability of multiple heads to attend to different sub‑spaces of the feature map. This expands the receptive field and enables the model to incorporate non‑local information without prohibitive computational cost.
The overall architecture consists of: (1) an encoder E_φ that extracts a latent feature map z from the low‑resolution input using any standard ISR backbone (EDSR, RDN, SwinIR). No down‑sampling is performed, so the spatial size of z matches the LR image. (2) a decoder D_ρ that, for each high‑resolution coordinate, selects the four nearest latent codes, concatenates them with the hierarchical encodings and cell size, passes them through an MLP, and then through the MHA block. The output of each hierarchical level is fed into the next level, progressively refining the representation. Finally, a skip connection adds a bilinearly up‑sampled LR image to the decoder output, yielding the final HR result.
Experiments were conducted on DIV2K validation and Set5, covering a wide range of scaling factors (×2, ×3, ×4, ×6, ×12, ×18, ×24, ×30) and both in‑distribution and out‑of‑distribution scenarios. Quantitative results (Table 1) show that adding HIIF to each backbone consistently improves PSNR by 0.07–0.17 dB over the strongest baselines (LIIF, LTE, CiaoSR, CLIT, SRNO). The gains are especially pronounced at large up‑sampling factors (e.g., ×12, ×24) where high‑frequency detail is hardest to recover. Visual comparisons demonstrate sharper edges and more faithful texture reconstruction.
Key contributions of the work are:
- Hierarchical positional encoding for continuous super‑resolution, the first to use multi‑scale encodings to model local neighborhoods.
- A multi‑scale decoder architecture that concatenates hierarchical encodings with latent features, allowing the network to learn scale‑specific sub‑bands.
- Multi‑head linear attention within the implicit function, enabling efficient non‑local context aggregation—another first for ISR.
The authors also discuss limitations: the number of hierarchical levels and attention heads increase memory and compute demands, which may hinder real‑time deployment. Future directions include adaptive level selection, hardware‑friendly attention variants, extension to video super‑resolution, and joint optimization for compression‑aware super‑resolution.
In summary, HIIF provides a flexible, plug‑and‑play module that can be attached to any existing ISR encoder, delivering superior continuous super‑resolution performance through explicit multi‑scale positional modeling and efficient global attention.
Comments & Academic Discussion
Loading comments...
Leave a Comment