AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation
The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.
💡 Research Summary
The paper introduces AMD‑HookNet++, a novel hybrid CNN‑Transformer architecture designed for accurate segmentation of glacier calving fronts in synthetic aperture radar (SAR) imagery. Building on the earlier AMD‑HookNet, which employed two parallel U‑Net branches (a low‑resolution context branch and a high‑resolution target branch), the authors replace the context branch with a Vision Transformer (specifically a Swin‑Transformer) to capture long‑range dependencies, while retaining a conventional CNN encoder‑decoder for the target branch to preserve fine‑grained spatial details.
A central contribution is the Enhanced Spatial‑Channel Attention (ESCA) module, which fuses the heterogeneous feature maps from the two branches. ESCA first applies channel‑wise attention to highlight the most informative channels, then performs spatial attention to re‑weight token relationships across the image plane. This dual‑attention mechanism enables the global context learned by the Transformer to be effectively combined with the local detail retained by the CNN, resulting in richer, more discriminative representations.
To further improve boundary precision, the authors introduce a pixel‑to‑pixel contrastive deep supervision strategy. Hierarchical pyramid features are projected into a pixel embedding space where a contrastive loss pulls together embeddings of the same class and pushes apart those of different classes. This loss is added to the standard cross‑entropy loss, guiding the network to learn class‑discriminative embeddings at multiple scales and reducing the “jagged” artifacts commonly observed in pure Transformer‑based segmenters.
The method is evaluated on the CaFFe benchmark, a challenging dataset of 681 SAR images with manually annotated calving fronts. AMD‑HookNet++ achieves an Intersection‑over‑Union (IoU) of 78.2 %, surpassing the original AMD‑HookNet (≈69.7 %) and the state‑of‑the‑art HookFormer (≈75.5 %). It also records a 95th‑percentile Hausdorff distance (HD95) of 1,318 m and a Mean Distance Error (MDE) of 367 m, matching or exceeding the performance of HookFormer while delivering smoother front delineations. Qualitative visualizations demonstrate that the proposed model eliminates the “toothed” edges typical of Vision‑Transformer outputs, producing continuous, physically plausible front lines even in noisy SAR conditions.
Key contributions are: (1) a two‑branch hybrid architecture that assigns distinct roles—global context to the Transformer and local detail to the CNN—thereby mitigating the weaknesses of each individual paradigm; (2) the ESCA module for efficient spatial‑channel fusion of heterogeneous features; (3) a contrastive deep supervision scheme that embeds pixel‑level metric learning into the segmentation loss; and (4) the inclusion of HD95 alongside IoU and MDE to provide a more comprehensive assessment of front‑line accuracy.
The authors acknowledge limitations such as the computational overhead of the Transformer branch and the relatively small size of the CaFFe dataset, which may affect generalization. Future work is proposed to explore lightweight Transformer variants, multi‑sensor fusion (e.g., optical‑SAR), and model compression techniques to enable real‑time operational monitoring of glacier dynamics.
Comments & Academic Discussion
Loading comments...
Leave a Comment