Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation

Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Assuming that neither source data nor the source model is accessible, black box domain adaptation represents a highly practical yet extremely challenging setting, as transferable information is restricted to the predictions of the black box source model, which can only be queried using target samples. Existing approaches attempt to extract transferable knowledge through pseudo label refinement or by leveraging external vision language models (ViLs), but they often suffer from noisy supervision or insufficient utilization of the semantic priors provided by ViLs, which ultimately hinder adaptation performance. To overcome these limitations, we propose a dual teacher distillation with subnetwork rectification (DDSR) model that jointly exploits the specific knowledge embedded in black box source models and the general semantic information of a ViL. DDSR adaptively integrates their complementary predictions to generate reliable pseudo labels for the target domain and introduces a subnetwork driven regularization strategy to mitigate overfitting caused by noisy supervision. Furthermore, the refined target predictions iteratively enhance both the pseudo labels and ViL prompts, enabling more accurate and semantically consistent adaptation. Finally, the target model is further optimized through self training with classwise prototypes. Extensive experiments on multiple benchmark datasets validate the effectiveness of our approach, demonstrating consistent improvements over state of the art methods, including those using source data or models.


💡 Research Summary

The paper tackles the most restrictive scenario in domain adaptation, Black‑Box Domain Adaptation (BBDA), where neither source data nor the source model parameters are available; only the predictions of a black‑box source model can be queried. Existing BBDA methods either rely solely on these noisy predictions or incorporate a vision‑language model (ViL) such as CLIP without fully exploiting the complementary strengths of the two teachers. To address these gaps, the authors propose Dual‑Teacher Distillation with Subnetwork Rectification (DDSR), a two‑stage framework that jointly leverages the task‑specific knowledge embedded in the black‑box source model and the broad semantic priors of CLIP.

Stage 1 – Dual‑Teacher Knowledge Distillation and Subnetwork Regularization
Both teachers generate soft predictions for each target sample: (\hat y_b) from the black‑box source model and (\hat y_c) from CLIP. The method computes the entropy of each prediction ( (H_b) and (H_c) ) and defines an adaptive weight (\alpha = H_c/(H_b+H_c)). When the target dataset is large ((n_t > \tilde n_t)), CLIP’s predictions receive higher weight; otherwise the source model is emphasized. The fused prediction (\hat y) becomes the pseudo‑label for knowledge distillation. The target network (f_t) is trained by minimizing the KL‑divergence between its output (\hat y_t) and (\hat y). To improve robustness, a Mixup‑based consistency loss and an information‑maximization loss are added.

A crucial innovation is the subnetwork‑driven regularization. A subnetwork shares part of the architecture and parameters with the target model but is trained separately. Two regularizers are imposed: (1) output alignment, encouraging the subnetwork’s predictions to match those of the target model, and (2) gradient discrepancy, penalizing differences in gradients with respect to the same input. This dual regularization mitigates over‑fitting to noisy pseudo‑labels, a common failure mode in BBDA.

During training, the target model’s predictions are progressively refined and used to update the pseudo‑labels via an exponential moving average (EMA). Simultaneously, CLIP prompts are adapted to the target domain using a contrastive loss (\mathcal L_{cm}), ensuring that the semantic guidance from CLIP remains domain‑relevant.

Stage 2 – Prototype‑Based Self‑Training
After the first stage, the target model extracts feature vectors for all target samples. Class‑wise prototypes are computed as the mean feature of each predicted class. Each sample is then reassigned to the nearest prototype, producing corrected pseudo‑labels. These refined labels are employed in a second round of self‑training, further tightening the decision boundaries and enhancing feature discriminability.

Experiments
The authors evaluate DDSR on three widely used benchmarks: Office‑31, Office‑Home, and DomainNet, under the strict BBDA protocol (no source data, no source model access). DDSR consistently outperforms prior BBDA methods such as DINE, AEM, and BBC, as well as recent source‑free and unsupervised adaptation approaches that have access to the source model or data. Ablation studies confirm the importance of (i) adaptive fusion versus fixed weighting, (ii) the subnetwork regularizer, and (iii) the prototype‑based refinement.

Contributions

  1. An entropy‑driven adaptive fusion that balances task‑specific and generic semantic predictions, yielding reliable pseudo‑labels.
  2. A subnetwork‑based regularization scheme that curbs over‑fitting to noisy supervision through output and gradient consistency.
  3. A dynamic loop that refines pseudo‑labels and CLIP prompts using the evolving target model, followed by prototype‑driven self‑training for final performance gains.

Overall, DDSR presents a comprehensive solution to BBDA, demonstrating that careful integration of heterogeneous teachers, regularization via a shared subnetwork, and iterative self‑training can overcome the severe information constraints of the black‑box setting and set a new performance baseline for privacy‑preserving domain adaptation.


Comments & Academic Discussion

Loading comments...

Leave a Comment