Metric and Kernel Learning using a Linear Transformation
Metric and kernel learning are important in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study metric learning as a problem of learning a linear transformation of the input data. We show that for high-dimensional data, a particular framework for learning a linear transformation of the data based on the LogDet divergence can be efficiently kernelized to learn a metric (or equivalently, a kernel function) over an arbitrarily high dimensional space. We further demonstrate that a wide class of convex loss functions for learning linear transformations can similarly be kernelized, thereby considerably expanding the potential applications of metric learning. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision and text mining.
💡 Research Summary
The paper addresses two fundamental problems in machine learning: learning a distance (metric) function and learning a kernel function. Traditional Mahalanobis‑based metric learning suffers from quadratic growth in the number of parameters with data dimensionality and cannot capture non‑linear decision boundaries. Existing kernel‑learning methods often operate in a transductive setting, meaning the learned kernel cannot be applied to unseen points.
The authors propose to view metric learning as the problem of learning a linear transformation W such that the Mahalanobis distance d_W(x_i,x_j) = (x_i−x_j)^T W (x_i−x_j). They adopt the LogDet divergence D_{LD}(W,I)=tr(W)−log det W−d as the loss function. LogDet is a Bregman divergence on the cone of positive‑definite matrices; it automatically enforces positive‑definiteness, enjoys scale‑invariance, and preserves the range space, making the optimization well‑behaved.
To handle non‑linear data, the authors kernelize the formulation. By mapping each input x_i to a (possibly infinite‑dimensional) feature space via φ(·), the distance becomes d_W(φ(x_i),φ(x_j)). The corresponding kernel is κ(x,y)=φ(x)^T W φ(y). The key theoretical contribution (Theorem 3.1) shows that the optimal W* for the original problem and the optimal kernel matrix K* for the kernel problem are linked by K* = X^T W* X and W* = I + X M X^T, where X =
Comments & Academic Discussion
Loading comments...
Leave a Comment