LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States
Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.
💡 Research Summary
This paper investigates a fundamental limitation of using the final‑layer hidden states of decoder‑only large language models (LLMs) as sentence embeddings. Because LLMs are trained solely for next‑token prediction, the last hidden state of a token is optimized to discriminate the correct next token rather than to capture the overall meaning of the preceding sentence. Consequently, sentences that share similar local continuations can produce nearly identical hidden representations, even when their global semantics differ.
To address this, the authors adopt a truth‑conditional view of meaning: a sentence’s meaning can be approximated by the distribution over its possible continuations. They argue that the value vectors produced by the self‑attention mechanism directly encode the information needed to generate those continuations, and therefore are a more faithful proxy for sentence semantics than hidden states.
The core contribution is Value Aggregation (VA), a training‑free method that extracts the value vectors v_{l,h,n} from each attention head h at every layer l, concatenates the heads to form a d‑dimensional token vector v_{l,n}, and then averages over tokens to obtain a layer‑wise embedding \hat{v}l. A set of layers S is selected (based on empirical performance on a challenging retrieval task), and the final sentence embedding is the mean of the selected layer embeddings: V_agg = (1/|S|) Σ{l∈S} \hat{v}_l. This procedure requires only a forward pass, no additional prompts, and incurs negligible computational overhead.
Layer selection experiments on two backbones—LLaMA‑2 (7 B) and Qwen‑3 (8 B)—show that the best single‑layer hidden‑state embeddings appear in early layers, while VA’s performance improves monotonically with depth and consistently outperforms hidden‑state pooling in deeper layers. The authors therefore define default layer sets: LLaMA‑2 layers 20‑27 and Qwen‑3 layers 26, 27, 29, 30, 31.
Extensive evaluation on the Massive Text Embedding Benchmark (MTEB) demonstrates that VA surpasses all existing training‑free LLM embedding methods (e.g., L_T, WMP, HS‑Full) and even exceeds the ensemble‑based Explicit One‑Word Limitation (MetaEOL) approach, despite requiring far fewer forward passes.
Building on this insight, the paper introduces Aligned Weighted Value Aggregation (AlignedWVA). By prompting the model so that the last token’s attention scores serve as weights and by applying the output projection matrix W_O, the weighted value vectors are aligned to the common residual‑stream space of the model. This yields embeddings that are not only training‑free but also achieve state‑of‑the‑art performance among such methods, beating MetaEOL by a substantial margin (over 30 % relative gain on several tasks).
Finally, the authors explore a lightweight fine‑tuning regime—Fine‑tune‑VA—where only a small subset of parameters (primarily in the attention layers) are updated. Fine‑tune‑VA matches or outperforms fully fine‑tuned hidden‑state baselines while using less than 10 % of the trainable parameters, highlighting the efficiency of the VA representation.
In summary, the paper makes four key contributions: (1) a theoretical and empirical demonstration that attention value vectors capture sentence semantics more directly than hidden states; (2) the simple, training‑free Value Aggregation method; (3) the Aligned Weighted VA technique that sets a new performance ceiling for training‑free LLM embeddings; and (4) evidence that VA can be efficiently fine‑tuned to produce high‑quality, parameter‑efficient embedding models. The work opens avenues for applying VA to larger models, multimodal data, and large‑scale retrieval systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment