Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.


💡 Research Summary

The paper tackles the problem of test‑time adaptation (TTA) for multimodal models, where distribution shifts may affect only a subset of modalities. Existing multimodal TTA approaches either adapt all modalities indiscriminately or rely on simple confidence/entropy cues, leading to two major failure modes: (1) negative transfer in modalities that are already well‑aligned (unbiased modalities) and (2) catastrophic forgetting in the modality that is most affected by the shift (biased modality). This reflects the classic stability‑plasticity dilemma: a model must retain source‑domain knowledge (stability) while quickly learning new domain‑specific patterns (plasticity).

Key Insight – Redundancy Score
The authors observe that, after the fusion layer, the latent representations of a biased modality become highly correlated across dimensions, i.e., they exhibit strong inter‑dimensional redundancy. To quantify this, they define a redundancy score R(Z) as the average squared off‑diagonal entry of the normalized covariance matrix of the batch feature matrix Z. Formally, R(Z)=1/(D(D‑1))∑{i≠j}C{ij}² where C is the covariance matrix normalized by per‑dimension standard deviation. In a well‑disentangled latent space R≈0; under domain shift, especially for the biased modality, R rises sharply. By computing R for each modality on a mini‑batch and comparing the values, the method flags any modality whose Δ=R_m−min_n R_n exceeds a preset threshold δ as “biased”. This rule‑based, non‑parametric diagnosis works without source‑domain statistics and is robust across several corruption types.

Asymmetric Adaptation Architecture
Once the biased modality set G is identified, DASP modifies the standard modality‑specific adapters Φ_m. Each adapter is split into a low‑rank “stable” component ϕ_m^s and a high‑rank “plastic” component ϕ_m^p. The low‑rank design limits capacity, encouraging the component to capture domain‑agnostic, generalizable features; the high‑rank design provides enough expressive power to model domain‑specific variations.

  • For biased modalities (m∈G): the pipeline is ˜z_m = ϕ_m^p(ϕ_m^s(z_m)). During adaptation only ϕ_m^p is updated; ϕ_m^s is frozen, preventing the source knowledge embedded in the stable path from being overwritten.
  • For unbiased modalities (m∉G): the plastic branch is disabled, and the output is ˜z_m = ϕ_m^s(z_m). Here only ϕ_m^s is updated, guided by a KL‑regularization term L_kl = D_KL(p_m^tgt ‖ p_m^src). The KL term penalizes divergence between the current target‑domain predictive distribution and the original source‑domain distribution, thereby preserving stability.

Both branches are trained with the usual unsupervised TTA objective (e.g., entropy minimization) but the KL term ensures that updates to the stable adapters do not degrade source performance.

Experimental Validation
The authors evaluate DASP on two large‑scale multimodal benchmarks: Kinetics‑50‑C (video‑audio) and VGGSound‑C (audio‑video). They introduce systematic corruptions (e.g., noise, blur, temporal jitter) to one modality at a time, creating a realistic continuous‑shift scenario. Compared against state‑of‑the‑art methods such as READ, TSA, MD‑AA, and generic entropy‑based TTA, DASP achieves 3–5 percentage‑point higher overall accuracy. More importantly, it dramatically reduces performance loss on the unbiased modality (negative transfer) while quickly recovering accuracy on the corrupted modality (avoiding catastrophic forgetting).

Ablation studies confirm the necessity of each component: (i) using the redundancy score for diagnosis is crucial—randomly selecting a biased modality leads to severe negative transfer; (ii) employing both stable and plastic adapters is essential—using only plastic adapters causes forgetting, while using only stable adapters fails to adapt the biased modality.

Broader Impact and Applicability
DASP offers a principled solution to the stability‑plasticity dilemma in multimodal TTA. The redundancy score provides a lightweight, label‑free diagnostic that can be integrated into any multimodal pipeline. The asymmetric adapter design, with its low‑rank/high‑rank split, is computationally efficient and suitable for edge devices where memory and latency constraints are strict. Potential applications include autonomous driving (camera‑LiDAR‑radar fusion), robotics (vision‑touch‑audio), and multimodal surveillance, where environments evolve continuously and models must adapt without sacrificing previously learned safety‑critical knowledge.

In summary, the paper introduces a two‑stage “diagnose‑then‑mitigate” framework that (1) automatically detects which modality is suffering from a distribution shift via inter‑dimensional redundancy, and (2) applies a modality‑specific, asymmetric adaptation strategy that decouples stability and plasticity. This approach yields superior adaptability and robustness, setting a new benchmark for multimodal test‑time adaptation.


Comments & Academic Discussion

Loading comments...

Leave a Comment