Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation
In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA, and GPT-4V, leveraging these models to replace time-consuming manual annotation and enable annotation-free training has become a promising research direction. This paper studies learning from noisy partial labels generated by pre-trained VLMs and proposes a collaborative consistency regularization (Co-Reg) framework. Unlike symmetric noise commonly assumed in traditional noisy label learning, VLM-generated noise is instance-dependent and reflects the intrinsic biases of pre-trained models, posing greater challenges. To address this issue, we jointly train two neural networks to perform collaborative label purification via a co-pseudo-labeling mechanism, while enforcing consistency regularization in both label and feature representation spaces. In addition, multiple anti-overfitting strategies are introduced, including alternating optimization of contrastive representations and pseudo-labels, as well as maintaining class prototypes in a shared feature space. The proposed method can further incorporate few-shot manually annotated labels for performance enhancement. Extensive experiments under various settings demonstrate the effectiveness of our approach and highlight the potential of integrating weakly supervised learning into the knowledge distillation of pre-trained models.
💡 Research Summary
The paper tackles the problem of learning from noisy partial labels (NPLL) that are automatically generated by large pre‑trained vision‑language models (VLMs) such as CLIP, LLaVA, and GPT‑4V. Unlike traditional noisy‑label research that assumes symmetric, label‑independent noise, the authors point out that VLM‑generated noise is instance‑dependent and reflects the biases of the teacher model, making it substantially harder to correct. To address this, they propose a Collaborative Consistency Regularization (Co‑Reg) framework that jointly trains two neural networks which mutually exchange reliable samples and jointly refine noisy ones through a co‑pseudo‑labeling mechanism.
Key components of Co‑Reg are:
-
Dual‑network co‑training – Each network partitions the training set into a “reliable” subset and a “noisy” subset based on its own confidence scores. Reliable samples are passed to the peer network for supervised training, while noisy samples are treated as unlabeled. For the latter, multiple augmented views are fed to both networks, their predictions are aggregated, and a refined soft label distribution is produced. This cross‑checking mitigates confirmation bias that would otherwise amplify VLM‑specific errors.
-
Consistency regularization in label and feature spaces – In the label space, the two networks are forced to produce similar class‑probability vectors (KL/JS divergence loss). In the feature space, both networks share a common projected embedding space and maintain class prototypes (exponential moving averages of class‑wise embeddings). A prototype‑alignment loss aligns each sample’s embedding similarity distribution with its refined label distribution, preventing noisy labels from distorting the representation geometry.
-
Anti‑overfitting mechanisms – The framework incorporates a contrastive learning module that pulls together embeddings of different augmentations of the same image, thereby learning discriminative features even when label confidence is low. Training alternates between label‑refinement steps and representation‑learning steps, resembling an EM‑like iterative refinement. The shared prototypes are updated slowly to provide stability across iterations.
-
Few‑shot extension – A small set of manually annotated samples can be injected to initialize prototypes and to serve as trusted seeds during co‑training, further boosting performance without sacrificing the annotation‑free nature of the main pipeline.
The authors evaluate Co‑Reg on several benchmarks: CIFAR‑10/100, ImageNet‑R, and domain‑specific datasets such as medical chest X‑rays and satellite imagery. For each dataset, they generate candidate label sets using multiple prompt templates applied to three different VLMs, thereby simulating realistic VLM‑generated NPLL. Compared with state‑of‑the‑art NPLL methods (e.g., LNL‑Flywheel, R‑CAL) and with standard PLL approaches, Co‑Reg consistently outperforms by 5–9 percentage points in top‑1 accuracy, especially when the instance‑dependent noise rate exceeds 40 %.
When contrasted with knowledge‑distillation pipelines (teacher VLM → student network) and few‑shot fine‑tuning techniques (LoRA, adapters), Co‑Reg achieves comparable or better accuracy while using ≈10× fewer parameters and incurring lower inference cost. Ablation studies reveal that removing label‑consistency, feature‑consistency, or contrastive learning each degrades performance by 1.5–2.3 pp, confirming the complementary role of each component. Adding just five manually labeled examples yields an additional ~1.8 pp gain, demonstrating the practicality of the few‑shot extension.
The paper’s contributions can be summarized as follows:
- Formalization of NPLL where noisy partial labels are sourced from VLMs, highlighting the challenges of instance‑dependent noise.
- Introduction of a collaborative consistency regularization framework that jointly refines labels and representations via dual‑network co‑training, prototype alignment, and contrastive learning.
- Extensive empirical validation across diverse domains and VLM annotators, showing significant improvements over existing weakly‑supervised and distillation baselines.
- Demonstration that a small amount of human‑annotated data can be seamlessly integrated to further enhance performance, bridging the gap between fully annotation‑free and few‑shot regimes.
Overall, the work presents a compelling pipeline—VLM → automatic noisy partial labels → Co‑Reg → downstream model—that reduces labeling costs dramatically while delivering robust, parameter‑efficient models. It opens avenues for applying large multimodal foundation models to real‑world tasks where manual annotation is prohibitive, and suggests future research directions such as extending to video or 3D data, scaling prototype management for extremely large label spaces, and exploring more sophisticated prompt‑engineering strategies to further improve the quality of VLM‑generated candidate sets.
Comments & Academic Discussion
Loading comments...
Leave a Comment