Thinking inside the Convolution for Image Inpainting: Reconstructing Texture via Structure under Global and Local Side
Image inpainting has earned substantial progress, owing to the encoder-and-decoder pipeline, which is benefited from the Convolutional Neural Networks (CNNs) with convolutional downsampling to inpaint the masked regions semantically from the known regions within the encoder, coupled with an upsampling process from the decoder for final inpainting output. Recent studies intuitively identify the high-frequency structure and low-frequency texture to be extracted by CNNs from the encoder, and subsequently for a desirable upsampling recovery. However, the existing arts inevitably overlook the information loss for both structure and texture feature maps during the convolutional downsampling process, hence suffer from a non-ideal upsampling output. In this paper, we systematically answer whether and how the structure and texture feature map can mutually help to alleviate the information loss during the convolutional downsampling. Given the structure and texture feature maps, we adopt the statistical normalization and denormalization strategy for the reconstruction guidance during the convolutional downsampling process. The extensive experimental results validate its advantages to the state-of-the-arts over the images from low-to-high resolutions including 256256 and 512512, especially holds by substituting all the encoders by ours. Our code is available at https://github.com/htyjers/ConvInpaint-TSGL
💡 Research Summary
The paper addresses a fundamental yet under‑explored problem in modern encoder‑decoder image inpainting: the loss of both structural (high‑frequency) and texture (low‑frequency) feature maps during the convolutional down‑sampling stages of the encoder. While many recent works focus on extracting these two complementary cues and then fusing or guiding them in the decoder, they largely ignore that the down‑sampling itself discards valuable information, especially the fine‑grained texture details that are crucial for high‑quality reconstruction.
Key Contributions
-
Structure‑to‑Texture Reconstruction via Normalization/Denormalization – The authors propose to use the sparse structural feature map as a statistical guide for reconstructing the dense texture feature map. By applying spatially‑adaptive normalization to the texture map and then denormalizing it with scale and shift parameters derived from the structure map, they effectively inject structural statistics back into the texture representation after each down‑sampling layer. This reverses the typical “structure‑only” guidance and instead lets structure restore texture.
-
Global and Local Normalization Strategies – Two complementary normalization schemes are introduced:
- Global normalization computes mean and variance over the entire texture map (across all channels) and is paired with a global structural map that captures overall layout and large‑scale edges.
- Local normalization computes statistics per channel at each spatial location, highlighting fine‑grained texture; it is paired with a local residual structural map that encodes the remaining high‑frequency details after the global structure has been removed.
Experiments show that the combination “global texture (global norm) reconstructed from global structure” and “local texture (local norm) reconstructed from local residual structure” yields the best performance.
-
Cross‑Layer Balance Module – Observing that the relative importance of global versus local structure changes across encoder depth (global dominates early, local later), the authors design a dynamic weighting module that balances the two reconstruction streams. Feature equalization is applied to harmonize the contributions before feeding the fused texture back to the next convolutional layer.
-
Comprehensive Evaluation – The method is tested on several benchmarks (CelebA‑HQ, Places2, ImageNet subsets) at both 256×256 and 512×512 resolutions. Quantitatively, it surpasses state‑of‑the‑art methods (LaMa, CTSDG, ZITS, etc.) in PSNR (+0.5‑1.2 dB), SSIM (+0.02‑0.04), and LPIPS (≈10 % reduction). Qualitatively, the results exhibit sharper edges, more coherent textures, and fewer repetitive artifacts, especially in challenging regions such as facial features or intricate architectural details.
-
Generality – The authors replace the encoder of all compared methods with their proposed structure‑texture reconstruction encoder and observe consistent gains, demonstrating that the approach is not tied to a specific backbone but can serve as a drop‑in improvement for any encoder‑decoder inpainting system.
Technical Details
- Structure extraction uses a Canny edge detector and grayscale conversion, combined with the binary mask, fed into a partial convolution layer that only processes known pixels.
- Texture extraction follows standard convolutional layers.
- Normalization/Denormalization follows a SPADE‑like formulation: ( \hat{T}= \gamma_S \frac{T-\mu_T}{\sigma_T} + \beta_S ), where (\gamma_S, \beta_S) are derived from the structure map (global or residual).
- The two reconstructed texture maps are summed element‑wise, then passed through the next down‑sampling convolution.
- The cross‑layer balance module learns per‑layer scalar weights for the global and local streams, ensuring a smooth transition of emphasis from global to local cues as depth increases.
Strengths
- Directly tackles the often‑ignored down‑sampling information loss, providing a mathematically grounded remedy.
- The dual‑normalization scheme elegantly separates global layout from local detail, mirroring human visual processing.
- Dynamic balancing across layers adds adaptability that static fusion methods lack.
- Demonstrated compatibility with multiple existing backbones, indicating broad applicability.
Weaknesses and Open Issues
- The added normalization/denormalization and cross‑layer modules increase computational cost and memory footprint, which may hinder deployment on resource‑constrained devices.
- Reliance on edge detection (Canny) and grayscale conversion for structure may limit performance on images where color gradients convey essential structural cues.
- The statistical parameters are computed deterministically (mean/variance) rather than learned; a learnable attention‑based statistic could potentially yield further gains.
- The paper does not provide an ablation on the impact of mask complexity (e.g., irregular, large holes) on the reconstruction quality.
Future Directions
- Lightweight variants: Approximate the normalization operations or share parameters across layers to reduce overhead.
- Learnable statistics: Replace fixed mean/variance with small networks that predict adaptive statistics conditioned on the input.
- Color‑aware structure extraction: Incorporate learned edge detectors or multi‑scale gradient features that capture chromatic structure.
- Extending to video inpainting: The same reconstruction principle could be applied temporally, using motion‑consistent structure cues to restore texture across frames.
In summary, the paper presents a novel perspective on encoder‑side processing for image inpainting, showing that preserving and reconstructing texture via structure‑guided normalization can substantially mitigate the information loss inherent to convolutional down‑sampling. The method delivers state‑of‑the‑art quantitative and qualitative results across resolutions and demonstrates versatility as a plug‑and‑play enhancement for existing inpainting pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment