NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

💡 Research Summary

The paper addresses the problem of noisy labels in prompt learning for vision‑language foundation models such as CLIP. While prompt tuning is lightweight and effective, it typically relies on cross‑entropy (CE) loss, which is highly sensitive to label noise and can overfit noisy data. The authors propose NLPrompt, a simple yet powerful framework that combines two key components: PromptMAE and PromptOT.

PromptMAE replaces the conventional CE loss with mean absolute error (MAE) loss during prompt tuning. Although MAE is known for robustness to noisy labels, it has been largely avoided in classification because of slow convergence and lower accuracy in standard settings. The authors show that, in the context of vision‑language models where text and image embeddings are already well aligned, MAE can converge quickly and retain high accuracy. Using feature‑learning theory, they decompose the learnable prompt into task‑relevant (µ) and task‑irrelevant (ξₗ) components. Their analysis demonstrates that MAE suppresses the coefficients of the irrelevant components, effectively reducing the influence of noisy samples while preserving or even amplifying the alignment with the relevant feature µ. Consequently, the signal‑to‑noise ratio (SNR) of the learned prompt improves, leading to robustness against substantial label noise (e.g., 50% symmetric noise). Empirical results on Caltech‑101, CIFAR‑10/100, and other benchmarks confirm that PromptMAE maintains high accuracy and converges comparably to CE loss even under heavy noise.

PromptOT is a prompt‑based optimal transport (OT) data‑purification method. Traditional OT‑based sample selection uses randomly initialized prototypes, which are suboptimal for prompt learning. NLPrompt leverages the text encoder of the pre‑trained vision‑language model to generate class‑specific text embeddings that serve as prototypes. By constructing a cost matrix from cosine similarities between image features and these text prototypes, and solving an entropy‑regularized OT problem with the Sinkhorn algorithm, the method obtains an optimal transport matrix that partitions the dataset into a clean subset (S⁺) and a noisy subset (S⁻). The clean subset is trained with CE loss (which performs best on clean data), while the noisy subset is trained with PromptMAE (which is robust to noise). This dual‑loss strategy harmonizes the strengths of both losses and benefits from the global distribution alignment enforced by OT.

Comprehensive experiments demonstrate that NLPrompt consistently outperforms existing prompt‑learning baselines (CoOp, CoCoOp, JoAPR, etc.) across various noise rates (20%, 40%, 60%). Ablation studies reveal that PromptMAE alone already provides robustness, but the combination with PromptOT yields the highest gains. Moreover, replacing text‑based prototypes with random vectors degrades performance sharply, confirming that the intrinsic alignment of vision‑language models is crucial for effective purification.

In summary, NLPrompt offers a clean, efficient solution for robust prompt learning under noisy labels. By exploiting MAE’s inherent noise tolerance within the aligned feature space of vision‑language models and by using text‑driven optimal transport for data cleaning, the method achieves superior accuracy without resorting to complex meta‑learning, label‑correction, or extensive data augmentation techniques. This work opens avenues for applying prompt‑based fine‑tuning to real‑world, imperfect datasets across multimodal AI applications.

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment