A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation
Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose \textbf{A3-TTA}, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA.
💡 Research Summary
The paper introduces A3‑TTA, a novel test‑time adaptation (TTA) framework for semantic segmentation that operates without any source data and updates the model online in a single‑pass fashion. Existing TTA methods fall into three categories: batch‑normalization (BN)‑based, unsupervised entropy‑minimization, and pseudo‑label‑based approaches. BN‑based methods merely replace statistics and fail under large domain shifts; entropy‑minimization provides task‑agnostic supervision that can cause catastrophic forgetting; pseudo‑label methods rely on stochastic perturbations (dropout, augmentations, Gaussian noise) which generate noisy labels and lack explicit alignment with the source distribution, leading to error accumulation.
A3‑TTA tackles these issues by introducing Anchor‑Target Images (ATIs). For each incoming test image, the model computes a Class Compact Density (CCD) score, a global class‑level compactness metric derived from the outer product of softmax predictions. Low CCD values indicate that the prediction distribution is highly concentrated along the diagonal of the class‑wise similarity matrix, i.e., the model is confident and the image is likely close to the source domain. Images with the lowest CCD scores are selected as ATIs and their latent features are stored in a fixed‑capacity dynamic feature bank.
When a new test image arrives, its feature vector is compared against all stored ATI features using cosine similarity; the most similar anchor is retrieved. The anchor feature and the current feature are fused via a weighted average (λ≈0.5) followed by L2 normalization, effectively aligning the test representation toward an intermediate “source‑like” distribution. This alignment substantially improves the quality of subsequently generated pseudo‑labels.
Pseudo‑labels are refined through three complementary losses: (1) a semantic consistency loss (KL divergence between teacher and student predictions) that preserves class structure; (2) a boundary‑aware entropy minimization loss that multiplies pixel‑wise entropy by a boundary‑sensitivity weight derived from Sobel edges, thereby reducing uncertainty near object borders; and (3) a mean‑teacher loss that encourages the student to follow the teacher’s predictions. Crucially, the teacher model is updated with a self‑adaptive exponential moving average (EMA) rate α_t that is dynamically modulated by the normalized cross‑entropy divergence between teacher and student outputs. When domain shift is large, α_t becomes smaller, allowing the teacher to adapt quickly; when predictions are stable, α_t grows, preserving the teacher’s knowledge and preventing abrupt parameter changes.
The authors evaluate A3‑TTA on two medical segmentation tasks (cardiac structure and prostate) and on the adverse‑conditions version of Cityscapes using DeepLabV3+. Across all experiments, A3‑TTA yields large Dice improvements: +10.40 %p for cardiac, +17.68 %p for prostate, and +16.90 %p for Cityscapes, consistently outperforming state‑of‑the‑art TTA baselines such as TENT, PTBN, UPL‑TTA, CoTT‑A, and others. In a continual TTA scenario where the model is sequentially adapted to four distinct target domains, A3‑TTA maintains performance with less than 1 % degradation, whereas competing methods suffer 5–10 % drops, demonstrating strong anti‑forgetting capabilities.
Ablation studies confirm the importance of each component: CCD‑based ATI selection provides reliable anchors; the feature bank size (L) balances diversity and memory (L=256 offers the best trade‑off); the λ fusion weight and the base EMA rate (α_0=0.99) are robust across datasets. Qualitative analysis shows that ATIs visually resemble source images and that the alignment step reduces feature distribution gaps.
In summary, A3‑TTA combines (i) a lightweight, single‑forward‑pass confidence metric (CCD) to identify high‑quality anchors, (ii) a dynamic feature‑bank‑driven similarity alignment to bridge source‑target gaps, (iii) dual supervision (semantic consistency + boundary‑aware entropy) to generate clean pseudo‑labels, and (iv) a self‑adaptive EMA teacher‑student scheme to ensure stable online learning. This integrated design mitigates label noise and distribution mismatch simultaneously, making A3‑TTA well‑suited for real‑time, privacy‑sensitive applications such as clinical image analysis and autonomous driving where rapid, reliable adaptation is essential. Future work may explore multimodal anchor selection and temporal consistency for video streams.
Comments & Academic Discussion
Loading comments...
Leave a Comment