ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.

💡 Research Summary

ExpAlign tackles the challenge of open‑vocabulary grounding, where a model must localize arbitrary textual concepts in images without explicit region‑level annotations. Existing CLIP‑style approaches collapse a prompt into a single global embedding, which hampers fine‑grained attribute binding, spatial relations, and negation handling. Conversely, token‑level methods often rely on heavy cross‑attention modules or require phrase‑level supervision, making them unsuitable for weakly supervised dense prediction.

The core contribution of ExpAlign is the Expectation Alignment Head (EAH). For each image feature map at scale s and each textual prompt, the model computes token‑wise cosine similarities Sₛ(x,y,l) between the visual feature at spatial location (x,y) and the l‑th token embedding. These similarities are averaged over the spatial grid to obtain a global token score (\bar S(l)). A softmax with temperature τₜ produces a posterior distribution π(l) over tokens, effectively weighting more informative tokens higher while suppressing noisy ones. The final Expectation Alignment Map (EAM) is the expectation (\tilde S(x,y)=\sum_l π(l)S(x,y,l)). This operation is mathematically equivalent to attention‑based soft pooling in Multiple Instance Learning (MIL), treating each spatial location as an instance and each prompt as a bag. Consequently, token selection is learned implicitly without any token‑level labels.

To stabilize learning and enforce cross‑scale coherence, ExpAlign introduces a two‑pronged consistency regularization module. First, semantic consistency aggregates EAMs from all scales by down‑sampling them to the coarsest resolution (P5) and summing them. From the unified map, the top‑1 % of responses are selected, and their average score ℓ serves as a prompt‑level logit. A multi‑positive InfoNCE loss (Top‑K InfoNCE) treats these logits as positive examples and all other prompts for the same image as negatives, encouraging distinct prompt‑region alignments. Second, the Geometry‑Aware Consistency Objective (GACO) builds a high‑resolution unified map (via top‑down up‑sampling) and applies a softmax over both prompts and spatial locations to obtain a joint distribution P(p,i). Within each ground‑truth mask, the method computes the sigmoid‑transformed alignment confidence R(p,i), its mean μ and standard deviation σ, and a clipped relative consistency score A(p,i)=clip((R−μ)/σ, −c, c). The final GACO loss maximizes A(p,i)·log P(p,i) averaged over all positive pixels, thereby redistributing probability mass so that patches belonging to the same instance exhibit coherent geometry without imposing absolute spatial targets. Both losses are derived from an energy‑based free‑energy minimization perspective, providing a principled regularization framework.

The overall training objective combines the task‑specific detection/segmentation loss L_det/seg with the two consistency terms:
(L = L_{det/seg} + λ_{sem} L_{sem} + λ_{geo} L_{geo}).
During inference, the expectation head and consistency modules are removed, preserving the standard detection/segmentation pipeline and incurring no extra computational overhead.

Experiments are conducted with a frozen DINOv3‑ConvNeXt‑T visual encoder and evaluated on LVIS (minival), ODinW, and RefCOCO/+/g. ExpAlign achieves 36.2 AP_r on LVIS, surpassing state‑of‑the‑art methods of comparable scale, and shows pronounced gains on rare categories and long‑tail vocabularies. Ablation studies reveal that (i) removing token posterior weighting (i.e., using uniform averaging) drops AP by ~2–3 %, (ii) omitting GACO while keeping Top‑K InfoNCE reduces spatial coherence, and (iii) both regularizers together yield the best trade‑off between semantic discrimination and geometric consistency. Moreover, ExpAlign remains lightweight: it adds only a few million parameters and does not increase inference latency, unlike cross‑attention heavy baselines.

In summary, ExpAlign presents a theoretically grounded, MIL‑inspired token‑level alignment mechanism coupled with multi‑scale semantic and geometry‑aware regularization. This design enables fine‑grained, weakly supervised vision‑language grounding while maintaining efficiency, marking a significant step forward for open‑vocabulary detection and segmentation.

ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment