Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity
Deep-learning vision models have shown intriguing similarities and differences with respect to human vision. We investigate how to bring machine visual representations into better alignment with human representations. Human representations are often inferred from behavioral evidence such as the selection of an image most similar to a query image. We find that with appropriate linear transformations of deep embeddings, we can improve prediction of human binary choice on a data set of bird images from 72% at baseline to 89%. We hypothesized that deep embeddings have redundant, high (4096) dimensional representations; however, reducing the rank of these representations results in a loss of explanatory power. We hypothesized that the dilation transformation of representations explored in past research is too restrictive, and indeed we found that model explanatory power can be significantly improved with a more expressive linear transform. Most surprising and exciting, we found that, consistent with classic psychological literature, human similarity judgments are asymmetric: the similarity of X to Y is not necessarily equal to the similarity of Y to X, and allowing models to express this asymmetry improves explanatory power.
💡 Research Summary
The paper investigates how to bring deep‑learning visual representations into closer alignment with human similarity judgments. Using a pre‑trained VGG‑16 network, the authors extract 4096‑dimensional embeddings from the penultimate layer for a set of bird images. Human data consist of 112,784 triplet‑inequality constraints (TICs) collected via Amazon Mechanical Turk, where participants view a query image and two reference images and choose the more similar reference.
To model these judgments, the authors propose a similarity function ˆs_{qr}=f(z_q)^T W f(z_r), where f performs dimensionality reduction via principal component analysis (PCA) onto the top k components (k varies from 2 to 4096). The core of the study is the exploration of different constraints on the weight matrix W: (1) Identity (W = I) as a baseline, (2) Diagonal with non‑negative entries (W = diag(|v|)), (3) Symmetric (W = V^T V) allowing an arbitrary linear transform, and (4) Unconstrained (full free matrix).
Human choice probabilities are modeled with a logistic function of the difference in similarity scores, and model parameters are learned by maximizing the log‑likelihood of the observed choices using stochastic gradient descent with Nesterov momentum. Five‑fold cross‑validation provides training and validation accuracies; early stopping prevents over‑fitting.
Results show that the baseline identity model achieves only 67.8 % accuracy. Introducing a diagonal scaling improves performance to about 78 %, replicating earlier findings that simple “dilation” of deep features helps. Allowing a full symmetric linear transform yields further gains, reaching 89–90 % validation accuracy when k = 4096. The most expressive unconstrained model performs similarly, indicating that the additional parameters do not cause severe over‑fitting. Crucially, relaxing the symmetry constraint—so that similarity from A to B can differ from B to A—produces a modest but consistent boost in accuracy (red curve vs. purple). This empirically confirms long‑standing psychological evidence that human similarity judgments are asymmetric.
Generalization to unseen images was tested by holding out entire images rather than just triplets. Accuracy drops slightly but the ranking of model variants remains unchanged, demonstrating that the learned linear transform transfers to new visual instances.
The study contributes three main insights: (1) Deep visual embeddings can be linearly re‑parameterized to match human psychological embeddings with high fidelity; (2) richer linear transforms (beyond simple scaling) improve fit without catastrophic over‑fitting; (3) modeling asymmetry in similarity is essential for capturing human judgments. Limitations include the exclusive focus on linear mappings, the absence of direct neural (e.g., fMRI) validation, and the restriction to a single domain (bird images). Future work could explore non‑linear transformations, integrate brain activity data, and test the approach on broader visual categories.
Comments & Academic Discussion
Loading comments...
Leave a Comment