Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.

💡 Research Summary

The paper tackles a fundamental gap in the theory of in‑context learning (ICL): while recent work has explained how transformer‑style models can learn to perform Bayes‑optimal prediction from a prompt of examples in unimodal settings, nothing is known about the multi‑modal case where each example consists of several heterogeneous modalities (e.g., image and text). To fill this gap the authors introduce a mathematically tractable latent‑factor model for multi‑modal data. For each prompt j and each example i the two modality vectors (\bar z^{(j)}_i\in\mathbb R^{d_1}) and (\tilde z^{(j)}_i\in\mathbb R^{d_2}) are generated as noisy linear projections of a shared scalar latent variable (u^{(j)}_i). The response (y^{(j)}_i) is a noisy linear function of the same latent variable with a task‑specific coefficient (\zeta^{(j)}). Consequently, the joint distribution of covariates and response is Gaussian with a covariance matrix (\Lambda^{(j)} = I + m^{(j)} m^{(j)\top}) where (m^{(j)} =

Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment