Beyond Cosine Similarity
Cosine similarity, the standard metric for measuring semantic similarity in vector spaces, is mathematically grounded in the Cauchy-Schwarz inequality, which inherently limits it to capturing linear relationships–a constraint that fails to model the complex, nonlinear structures of real-world semantic spaces. We advance this theoretical underpinning by deriving a tighter upper bound for the dot product than the classical Cauchy-Schwarz bound. This new bound leads directly to recos, a similarity metric that normalizes the dot product by the sorted vector components. recos relaxes the condition for perfect similarity from strict linear dependence to ordinal concordance, thereby capturing a broader class of relationships. Extensive experiments across 11 embedding models–spanning static, contextualized, and universal types–demonstrate that recos consistently outperforms traditional cosine similarity, achieving higher correlation with human judgments on standard Semantic Textual Similarity (STS) benchmarks. Our work establishes recos as a mathematically principled and empirically superior alternative, offering enhanced accuracy for semantic analysis in complex embedding spaces.
💡 Research Summary
The paper revisits the mathematical foundation of similarity measurement in vector spaces, arguing that the ubiquitous cosine similarity is limited by its reliance on the Cauchy‑Schwarz inequality. Because cosine normalizes the dot product by the product of Euclidean norms, it reaches its maximum value of 1 only when the two vectors are linearly dependent (i.e., one is a scalar multiple of the other). The authors point out that this strict condition is rarely satisfied in real‑world semantic embeddings, where meaningful relationships often manifest as consistent ordinal patterns rather than exact proportionality.
To address this limitation, the authors derive a tighter upper bound for the dot product using the Rearrangement Inequality. For any vectors u and v in ℝᵈ, they prove
|u·v| ≤ ⟦u↑·v↕⟧ ≤ ‖u‖‖v‖ ≤ ½(‖u‖² + ‖v‖²),
where u↑ denotes u sorted in non‑decreasing order and v↕ denotes v sorted either non‑decreasing or non‑increasing depending on the sign of u·v. The term ⟦u↑·v↕⟧ is the tightest possible bound and becomes an equality precisely when the two vectors are similar in the sense of ordinal concordance (all component pairs preserve the same order) or discordant when the dot product is negative.
From this hierarchy of bounds they define three similarity measures:
-
recos (Rearrangement‑inequality‑based Cosine):
recos(u,v) = (u·v) / ⟦u↑·v↕⟧.
It attains 1 whenever the vectors share the same ranking of components, regardless of their absolute magnitudes. -
cos (standard cosine similarity):
cos(u,v) = (u·v) / (‖u‖‖v‖).
It reaches 1 only under exact linear dependence. -
decos (Degenerated Cosine):
decos(u,v) = (u·v) / ½(‖u‖² + ‖v‖²).
It equals 1 only when the vectors are identical or exact opposites.
The authors prove a series of corollaries establishing the inclusion relationship |decos| ≤ |cos| ≤ |recos| and detailing the precise conditions under which any two of the measures coincide. Notably, when vectors are unit‑norm (the common preprocessing step for embeddings), decos and cos become mathematically identical, highlighting that both discard magnitude information. In contrast, recos remains distinct because its denominator depends on the sorted component values, preserving ordinal information even after normalization.
An illustrative low‑dimensional example demonstrates how each metric reacts to three types of relationships: (i) perfect linear scaling, (ii) genuine dissimilarity, and (iii) non‑linear but order‑preserving similarity. recos uniquely assigns a maximal score to the order‑preserving case, whereas cos fails to differentiate it from genuine dissimilarity, and decos cannot separate linear scaling from outright disagreement.
Computationally, recos requires sorting each vector, yielding O(d log d) time versus O(d) for cosine. The authors argue that for typical embedding dimensions (hundreds to a few thousand) the overhead is modest and can be mitigated with batch sorting on GPUs.
The empirical evaluation spans eleven embedding families, including static models (Word2Vec, GloVe, FastText), contextual models (ELMo, BERT, RoBERTa, Sentence‑BERT), and multimodal models (CLIP). The authors benchmark against seven standard Semantic Textual Similarity (STS) datasets (STS‑12 through STS‑16, STS‑B, and SICK‑R). For each dataset they compute Pearson and Spearman correlations between human similarity judgments and each of the three similarity scores. Across the board, recos outperforms cosine, achieving average gains of 2–4 percentage points in Pearson correlation. The improvement is most pronounced for contextual embeddings and for high‑frequency tokens, where cosine is known to suffer from norm‑inflation bias. Statistical significance is confirmed via Fisher’s z‑test (p < 0.01).
The paper also discusses limitations. Because recos hinges on the exact ordering of components, it can be sensitive to small perturbations that flip a pairwise order, especially in high‑dimensional noisy embeddings. The authors acknowledge the need for robustness analyses (e.g., adding Gaussian noise, evaluating rank stability) and for exploring whether sparsity (as in TF‑IDF vectors) affects the usefulness of the sorted‑based denominator. Moreover, while the authors claim that recos retains discriminative power after unit‑norm normalization, they do not provide a detailed ablation on the impact of different preprocessing pipelines.
In conclusion, the work provides a mathematically rigorous alternative to cosine similarity that broadens the notion of “perfect similarity” from strict linear dependence to ordinal concordance. The theoretical hierarchy, clear saturation conditions, and extensive empirical validation make a compelling case for adopting recos in tasks where monotonic, non‑linear relationships between embedding dimensions carry semantic meaning. Future research should address robustness to noise, scalability to extremely large corpora, and integration with downstream tasks such as retrieval‑augmented generation or cross‑modal alignment, where the ordinal signal captured by recos may complement angular information traditionally used by cosine similarity.
Comments & Academic Discussion
Loading comments...
Leave a Comment