Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

💡 Research Summary

The paper introduces CaPL (Causality‑guided Prompt Learning), a novel method for adapting large pre‑trained vision‑language models such as CLIP to downstream recognition tasks, especially those requiring fine‑grained discrimination. Existing CLIP‑based prompt learning approaches either treat visual features holistically (global prompt learning) or focus on a fixed set of attributes (local prompt learning). Both strategies struggle to capture subtle inter‑class variations because they ignore that different attributes contribute unequally to recognition.

CaPL addresses this limitation with two tightly coupled modules.

Attribute Disentanglement Module – Given an image, the frozen CLIP image encoder produces a visual feature vector (x_i). This vector is decomposed into two latent representations: a non‑individualized attribute vector (s_i) that encodes attributes shared across several classes, and an individualized attribute vector (d_i) that encodes class‑specific cues. The decomposition is learned via a Brownian Bridge Diffusion Model (BBDM). In the forward diffusion process, (x_i) is gradually transformed into (s_i); in the reverse process, conditioned on (d_i), the model reconstructs (x_i) from (s_i). Training minimizes the L2 distance between the reconstructed visual feature and the original, forcing (s_i) and (d_i) to capture complementary semantic information.
Granule Learning Module – This module constructs “visual granules” that serve as supervision signals for the text prompt. For each of (K) learned query vectors (q_k) (each targeting a distinct individualized attribute), the model extracts a visual attribute representation (a^{d,i}_k) from (d_i) and a textual attribute representation (a^{p,c}_k) from the prompted text of class (c). Two causal interventions are then applied:
- Factual Intervention – Each individualized attribute is combined with the full set of non‑individualized attributes to form a factual granule (x^{i}_k = D(s_i, a^{d,i}k)), where (D) is an MLP decoder. The prompt is trained to (i) correctly identify which attribute generated each granule (via a softmax over cosine similarities) and (ii) correctly classify the original image by aggregating predictions over all granules. The loss (L{\text{factual}}) is a weighted sum of the two cross‑entropy terms.
- Counterfactual Intervention – To mitigate spurious correlations caused by homogeneous non‑individualized attributes, the model swaps non‑individualized attributes across images, creating counterfactual granules (\tilde{x}^{i}_k = D(s_j, a^{d,i}k)) where (s_j) comes from a different image. A similar classification loss (L{\text{counter}}) is applied.

The overall training objective combines the disentanglement loss (L_A), the factual loss, and the counterfactual loss (scaled by a hyper‑parameter). After training, inference proceeds by encoding a test image with CLIP, encoding class names with the learned prompt, and selecting the class with highest cosine similarity.

Experimental validation is performed on 15 public datasets, including several fine‑grained benchmarks such as Flowers102, FGVC‑Aircraft, and CUB‑200‑2011. CaPL consistently outperforms state‑of‑the‑art prompt learning methods (CoOp, CoCoOp, ProDA, etc.), achieving improvements of 3–5 percentage points on average for fine‑grained tasks and notable gains across all datasets. Ablation studies reveal that (i) removing the attribute disentanglement degrades performance, confirming the necessity of separating shared and class‑specific cues; (ii) omitting non‑individualized attributes also harms accuracy, showing that even weakly discriminative attributes provide useful context; (iii) using only factual or only counterfactual interventions yields smaller gains than the combined strategy, highlighting the complementary role of both causal manipulations.

Key insights:

Modeling the generation of visual features as a stochastic diffusion conditioned on class‑specific attributes enables a principled separation of shared vs. unique visual cues.
Constructing visual granules through causal interventions supplies the prompt learner with explicit, attribute‑aware supervision, allowing the text prompt to encode fine‑grained distinctions that global methods miss.
Counterfactual granules act as a regularizer that discourages the prompt from over‑relying on ubiquitous attributes, thereby improving generalization to novel attribute combinations.

In summary, CaPL introduces a causality‑driven framework that unifies diffusion‑based attribute disentanglement with factual and counterfactual granule construction, delivering a more discriminative and robust text prompt for CLIP. The approach opens avenues for extending causal prompt learning to other multimodal models, dynamic attribute selection, and downstream generation tasks.

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment