DeCLIP: Decoupled Prompting for CLIP-based Multi-Label Class-Incremental Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-label class-incremental learning (MLCIL) continuously expands the label space while recognizing multiple co-occurring classes, making it prone to catastrophic forgetting and high false-positive rates (FPR). Extending CLIP to MLCIL is non-trivial because co-occurring categories violate CLIP’s single image-text alignment paradigm and task-level partial labeling induces high FPR. We propose DeCLIP, a replay-free and parameter-efficient framework that decouples CLIP representations via a one-to-one class-specific prompting scheme. By assigning each category its own prompt space, DeCLIP prevents semantic confusion across labels and decouples multi-label images into per-class views compatible with CLIP pre-training. The learned prompts are preserved as knowledge anchors, mitigating catastrophic forgetting without replay. We further introduce Adaptive Similarity Tempering (AST), a task-aware strategy that suppresses FPR without dataset-specific tuning. Experiments on MS-COCO and PASCAL VOC show that DeCLIP consistently outperforms prior methods with minimal trainable parameters.

💡 Research Summary

DeCLIP introduces a novel, replay‑free framework for multi‑label class‑incremental learning (MLCIL) built on top of the pre‑trained CLIP model. The authors identify two fundamental challenges when extending CLIP to MLCIL: (1) semantic confusion caused by co‑occurring labels that violate CLIP’s single image‑text alignment, and (2) a dramatically high false‑positive rate (FPR) due to task‑level partial labeling, where only the current task’s labels are available during training. To address these issues, DeCLIP proposes (i) a one‑to‑one class‑specific prompting scheme and (ii) an Adaptive Similarity Tempering (AST) mechanism.

In the prompting scheme, each class c receives its own pair of positive and negative prompts for both the visual and textual branches. The positive prompt encodes the presence of the class, while the negative prompt encodes its absence. During training, the frozen CLIP visual encoder receives the visual prompt together with the input image, producing class‑specific visual tokens. Simultaneously, the frozen CLIP text encoder processes the class name combined with the textual prompts, yielding class‑specific text embeddings. Cosine similarity is computed between the visual tokens and the text embedding, producing a positive similarity s⁺ and a negative similarity s⁻ for each class. These two scores are normalized by a binary softmax (temperature τ=1 during training) to obtain a per‑class presence confidence ˆy⁺. After a class has been learned, its prompts are frozen and stored as lightweight “knowledge anchors.” Because prompts are never overwritten by later tasks, catastrophic forgetting is mitigated without any replay buffer.

The second component, AST, tackles the over‑confidence of absent classes that arises from the missing negative labels in each incremental step. For each class, DeCLIP applies a task‑aware temperature τ(t) to the softmax that normalizes (s⁺, s⁻) at inference time. τ(t) is defined as τ(t)=max(1, λ·t/|C₁:t|), where |C₁:t| is the total number of classes learned up to task t and λ is a small constant shared across datasets. This schedule guarantees τ(t)≥1 and gradually increases the temperature as more tasks are added, automatically adapting to different incremental configurations. Higher temperature smooths the softmax, pulling down overly confident scores for absent classes and thereby reducing FPR dramatically (from ~25 % to ~2 % on VOC B4‑C2). Unlike generic regularizers such as maximum‑entropy or asymmetric loss, AST operates directly on the similarity pair, preserving the discriminative power of the learned prompts.

The overall training pipeline proceeds task by task: for the current task t, only the prompts for the new classes C_t are optimized while all previously learned prompts remain fixed. At inference, the full set of prompts for classes C₁:t is applied simultaneously, and AST calibrates each class’s confidence. Because each class has its own dedicated prompts, the multi‑label image is effectively decomposed into a set of binary classification problems, perfectly aligning with CLIP’s original single‑image‑text training objective.

Extensive experiments on MS‑COCO and PASCAL VOC demonstrate that DeCLIP consistently outperforms prior MLCIL methods, including replay‑based (PRS, OCDM), regularization‑based (KRT, CSC, HCP), and prompt‑based (MUL‑TI‑LANE) approaches. Gains are observed across mAP, CF1, and macro‑F1 metrics, with particularly strong reductions in false‑positive rates. Ablation studies confirm that (a) removing class‑specific prompts leads to severe semantic confusion and performance collapse, and (b) disabling AST causes a sharp rise in FPR. Parameter analysis shows that DeCLIP adds only a few dozen trainable tokens per class, resulting in negligible overhead compared to the frozen CLIP backbone and eliminating the need for memory buffers.

In summary, DeCLIP’s contributions are threefold: (1) a decoupled prompting strategy that maps each class to its own prompt space, thereby eliminating semantic interference and serving as replay‑free knowledge anchors; (2) an adaptive temperature‑based similarity scaling that robustly suppresses false positives without dataset‑specific tuning; and (3) a lightweight, scalable design that achieves state‑of‑the‑art performance on challenging multi‑label incremental benchmarks. Future work may explore richer prompt architectures, extension to hierarchical or relational labels, and deployment in real‑time streaming scenarios.

DeCLIP: Decoupled Prompting for CLIP-based Multi-Label Class-Incremental Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment